├── lm ├── __init__.py ├── configs │ ├── base.json │ ├── mega.json │ └── large.json ├── README.md ├── validate.sh ├── train_tpu_adafactor.sh ├── train.py ├── dataloader.py ├── validate.py ├── utils.py └── optimization_adafactor.py ├── sample ├── __init__.py ├── README.md ├── contextual_generate.py ├── encoder.py └── april2019_set_mini.jsonl ├── requirements-gpu.txt ├── requirements-tpu.txt ├── realnews ├── process_ccrawl.sh ├── prepare_lm_data.sh ├── README.md ├── cc_files.txt ├── prepare_lm_data.py ├── process_ccrawl.py ├── dedupe_crawl.py └── realnews_tiny.jsonl ├── download_model.py ├── Dockerfile ├── generation_examples ├── README.md └── compute_accuracy_script.py ├── .gitignore ├── discrimination ├── README.md └── run_discrimination.py ├── README.md └── LICENSE /lm/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /sample/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /requirements-gpu.txt: -------------------------------------------------------------------------------- 1 | pandas==0.24.2 2 | regex==2019.4.14 3 | h5py==2.9.0 4 | numpy==1.16.2 5 | tensorboard==1.13.1 6 | tensorflow-gpu==1.13.1 7 | tqdm==4.31.1 8 | requests==2.22.0 -------------------------------------------------------------------------------- /requirements-tpu.txt: -------------------------------------------------------------------------------- 1 | pandas==0.24.2 2 | regex==2019.4.14 3 | h5py==2.9.0 4 | numpy==1.16.2 5 | tensorboard==1.13.1 6 | tensorflow==1.13.1 7 | tensorflow-estimator==1.13.0 8 | tqdm==4.31.1 9 | requests==2.22.0 -------------------------------------------------------------------------------- /sample/README.md: -------------------------------------------------------------------------------- 1 | # sample 2 | 3 | This contains code for sampling, generating, etc. from the language model 4 | 5 | You can use the file `contextual_generate.py` which takes metadata and produces an article. -------------------------------------------------------------------------------- /realnews/process_ccrawl.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | DUMP2CRAWL=$1 4 | aws s3 cp "s3://commoncrawl/crawl-data/${DUMP2CRAWL}/warc.paths.gz" ~/temp/ 5 | rm -f ~/temp/warc.paths 6 | gunzip ~/temp/warc.paths.gz 7 | 8 | parallel -j $(nproc --all) --will-cite python process_ccrawl.py -path "{1}" ">" "~/logs/{%}.txt" < ~/temp/warc.paths 9 | -------------------------------------------------------------------------------- /lm/configs/base.json: -------------------------------------------------------------------------------- 1 | { 2 | "vocab_size": 50270, 3 | "hidden_size": 768, 4 | "attention_probs_dropout_prob": 0.1, 5 | "hidden_dropout_prob": 0.1, 6 | "hidden_act": "gelu", 7 | "initializer_range": 0.02, 8 | "intermediate_size": 3072, 9 | "max_position_embeddings": 1024, 10 | "num_attention_heads": 12, 11 | "num_hidden_layers": 12 12 | } -------------------------------------------------------------------------------- /lm/configs/mega.json: -------------------------------------------------------------------------------- 1 | { 2 | "vocab_size": 50270, 3 | "hidden_size": 1536, 4 | "attention_probs_dropout_prob": 0.1, 5 | "hidden_dropout_prob": 0.1, 6 | "hidden_act": "gelu", 7 | "initializer_range": 0.014142135623731, 8 | "intermediate_size": 6144, 9 | "max_position_embeddings": 1024, 10 | "num_attention_heads": 24, 11 | "num_hidden_layers": 48 12 | } -------------------------------------------------------------------------------- /lm/configs/large.json: -------------------------------------------------------------------------------- 1 | { 2 | "vocab_size": 50270, 3 | "hidden_size": 1024, 4 | "attention_probs_dropout_prob": 0.1, 5 | "hidden_dropout_prob": 0.1, 6 | "hidden_act": "gelu", 7 | "initializer_range": 0.02, 8 | "intermediate_size": 4096, 9 | "max_position_embeddings": 1024, 10 | "num_attention_heads": 16, 11 | "num_hidden_layers": 24, 12 | "max_batch_size_per_core": 3 13 | } -------------------------------------------------------------------------------- /realnews/prepare_lm_data.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | NUM_FOLDS=1024 4 | MAX_SEQ_LENGTH=1024 5 | FN=${1} 6 | OUT_BUCKET=${2} 7 | 8 | rm -rf logs_${MAX_SEQ_LENGTH} 9 | mkdir logs_${MAX_SEQ_LENGTH} 10 | parallel -j $(nproc --all) --will-cite "python prepare_data.py -fold {1} -num_folds ${NUM_FOLDS} -base_fn gs://${OUT_BUCKET}/data_${MAX_SEQ_LENGTH}/ -input_fn ${FN} -max_seq_length ${MAX_SEQ_LENGTH} > logs_${MAX_SEQ_LENGTH}/log{1}.txt" ::: $(seq 0 $((${NUM_FOLDS}-1))) 11 | -------------------------------------------------------------------------------- /lm/README.md: -------------------------------------------------------------------------------- 1 | # What everything does 2 | 3 | * `validate.py` gets perplexity. You can use the script `validate.sh` which contains some arguments I used. 4 | * `train.py` trains a grover model from scratch. The script `train_tpu_adafactor.sh` could also help. You probably don't want to do this unless you have a lot of money for TPUs. However, it might be a good idea to finetune Grover to a different domain. 5 | 6 | # Setting up tensorboard 7 | 8 | During training, you can use tensorboard by running the following commands: 9 | 10 | ``` 11 | ssh -L 6006:localhost:6006 myservername 12 | tensorboard --logdir="grover":"gs://MYOUTPUTPATH" --port=6006 13 | ``` -------------------------------------------------------------------------------- /lm/validate.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | export PYTHONPATH=/home/rowanz/code/fakenewslm 4 | 5 | max_seq_length=1024 6 | num_tpu_cores=8 7 | batch_size_per_core=2 8 | model_type=$1 9 | input_file="" 10 | OUTPUT_DIR="" 11 | init_checkpoint="" 12 | model_type="" 13 | 14 | let batch_size="$batch_size_per_core * $num_tpu_cores" 15 | 16 | 17 | python validate.py \ 18 | --config_file=configs/${model_type}.json \ 19 | --input_file=${input_file} \ 20 | --output_dir=${OUTPUT_DIR} \ 21 | --max_seq_length=${max_seq_length} \ 22 | --batch_size=${batch_size} \ 23 | --use_tpu=True \ 24 | --tpu_name=$(hostname) \ 25 | --num_tpu_cores=$num_tpu_cores \ 26 | --init_checkpoint=${init_checkpoint} \ 27 | --validation_name="preds.h5" -------------------------------------------------------------------------------- /download_model.py: -------------------------------------------------------------------------------- 1 | # This is just for downloading the generator. See `discrimination/` for the discrimination checkpoints 2 | 3 | import os 4 | import requests 5 | import argparse 6 | 7 | parser = argparse.ArgumentParser(description='Download a model!') 8 | parser.add_argument( 9 | 'model_type', 10 | type=str, 11 | help='Valid model names: (base|large|mega)', 12 | ) 13 | model_type = parser.parse_args().model_type 14 | 15 | model_dir = os.path.join('models', model_type) 16 | if not os.path.exists(model_dir): 17 | os.makedirs(model_dir) 18 | 19 | for ext in ['data-00000-of-00001', 'index', 'meta']: 20 | r = requests.get(f'https://storage.googleapis.com/grover-models/{model_type}/model.ckpt.{ext}', stream=True) 21 | with open(os.path.join(model_dir, f'model.ckpt.{ext}'), 'wb') as f: 22 | file_size = int(r.headers["content-length"]) 23 | if file_size < 1000: 24 | raise ValueError("File doesn't exist? idk") 25 | chunk_size = 1000 26 | for chunk in r.iter_content(chunk_size=chunk_size): 27 | f.write(chunk) 28 | print(f"Just downloaded {model_type}/model.ckpt.{ext}!", flush=True) 29 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 2 | 3 | RUN apt-get update && apt-get install -y --no-install-recommends \ 4 | build-essential \ 5 | cmake \ 6 | git \ 7 | curl \ 8 | vim \ 9 | ca-certificates \ 10 | libjpeg-dev \ 11 | libpng-dev &&\ 12 | rm -rf /var/lib/apt/lists/* 13 | 14 | RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh && \ 15 | chmod +x ~/miniconda.sh && \ 16 | ~/miniconda.sh -b -p /opt/conda && \ 17 | rm ~/miniconda.sh && \ 18 | /opt/conda/bin/conda install -y python=3.6 tqdm numpy pyyaml scipy ipython mkl mkl-include cython typing h5py pandas && \ 19 | /opt/conda/bin/conda clean -ya && /opt/conda/bin/pip install tensorflow-gpu==1.13.1 20 | 21 | ENV LC_ALL=C.UTF-8 22 | ENV LANG=C.UTF-8 23 | ENV PATH /opt/conda/bin:/usr/local/nvidia/bin/:$PATH 24 | ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64 25 | ENV NVIDIA_VISIBLE_DEVICES all 26 | ENV NVIDIA_DRIVER_CAPABILITIES compute,utility 27 | 28 | WORKDIR /grover 29 | 30 | COPY requirements-gpu.txt . 31 | RUN pip install -r requirements-gpu.txt 32 | 33 | ENV PYTHONPATH /grover 34 | 35 | ADD . . 36 | 37 | CMD ["/bin/bash"] 38 | -------------------------------------------------------------------------------- /generation_examples/README.md: -------------------------------------------------------------------------------- 1 | # Data for Grover 2 | 3 | This folder contains some generation examples from Grover. You can get everything by running 4 | 5 | `gsutil cp -r "gs://grover-models/generation_examples/*" .` 6 | 7 | Alternatively, run 8 | ``` 9 | wget https://storage.googleapis.com/grover-models/generation_examples/generator=mega~dataset=p0.94.jsonl 10 | wget https://storage.googleapis.com/grover-models/generation_examples/generator=mega~discriminator=grover~discsize=mega~dataset=p0.94~test-probs.npy 11 | wget https://storage.googleapis.com/grover-models/generation_examples/generator=mega~discriminator=grover~discsize=mega~dataset=p0.94~val-probs.npy 12 | ``` 13 | 14 | This downloads a dataset of news articles from April 2019, along with Grover-Mega generations. I used this setup to measure the Grover discrimination accuracy. These are just from using Nucleus Sampling `p=0.94` because that was found (from a grid search) to be the hardest to detect when Grover-Mega is both the generator as well as the discriminator. I did this separately for each discriminator-generator combo. 15 | 16 | It also downloads the predicted machine/human probabilities given by Grover-Mega as a discriminator. 17 | 18 | To compute the accuracy in the way I did for the paper, use [compute_accuracy_script.py]. -------------------------------------------------------------------------------- /realnews/README.md: -------------------------------------------------------------------------------- 1 | # Downloading the existing RealNews dataset 2 | 3 | A tiny version is available for debugging purposes (`realnews_tiny.jsonl`). You can download the full version, just please [submit this form](https://docs.google.com/forms/d/1LMAUeUtHNPXO9koyAIlDpvyKsLSYlrBj3rYhC30a7Ak). 4 | 5 | # Code for scraping the realnews dataset 6 | 7 | You probably don't want to create your own realnews dataset. But for reproducibility purposes here's what I did: 8 | 9 | Setup: 10 | * You need to spawn off an AWS machine in `us-east-1`. That's where [common crawl](https://registry.opendata.aws/commoncrawl/) is located. My recommendation is to get a machine with as many CPUs as possible. We used roughly 15 machines, each with 72 CPUs. Thankfully, common crawl is broken up into many pieces so the work can be easily distributed amongst these machines 11 | * Make a new `s3` bucket, which also needs to be in the `us-east-1` region. 12 | 13 | 14 | Now, let's get started: 15 | * Use any of the ids in cc_files (like the last one, `CC-MAIN-2019-13` which is for March 2019). 16 | * Then run `process_ccrawl.sh CC-MAIN-2019-13`, and this will crawl that in parallel using all of your CPUs. You will probably need to change the arguments in `process_ccrawl.py` so it goes to your bucket. 17 | * Do this a lot, and now you probably need to deduplicate, so use `dedupe_crawl.py`. This can be done on 1 CPU. 18 | * Last, you'll want to convert everything to tfrecords and move them to Google Cloud. You also probably want to do this in parallel. This can be done using `prepare_lm_data.sh` 19 | 20 | That's it! -------------------------------------------------------------------------------- /lm/train_tpu_adafactor.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | export PYTHONPATH=/home/rowanz/code/fakenewslm 4 | 5 | learning_rate=1e-4 6 | init_checkpoint="" 7 | max_seq_length=1024 8 | save_checkpoint_steps=1000 9 | 10 | # You can customize the training here 11 | # mega, medium, or base 12 | model_type="base" 13 | OUTPUT_DIR="gs://" # put your output directory here 14 | input_file="gs://" # put your input files here, it can also be something like "*.tfrecord" 15 | 16 | if [ ${model_type} == "base" ]; then 17 | num_tpu_cores=32 18 | batch_size_per_core=16 19 | elif [ ${model_type} == "medium" ]; then 20 | num_tpu_cores=128 21 | batch_size_per_core=4 22 | elif [ ${model_type} == "mega" ]; then 23 | num_tpu_cores=256 24 | batch_size_per_core=2 25 | fi 26 | 27 | 28 | # there are 20k * 1024 examples so this translates to 20 epochs. seems ok and i can run for more if needed 29 | num_train_steps=800000 30 | 31 | # Make sure batch size scales. 32 | let batch_size="$batch_size_per_core * $num_tpu_cores" 33 | 34 | python train.py \ 35 | --config_file=configs/${model_type}.json \ 36 | --input_file=${input_file} \ 37 | --output_dir=${OUTPUT_DIR} \ 38 | --max_seq_length=${max_seq_length} \ 39 | --train_batch_size=${batch_size} \ 40 | --learning_rate=${learning_rate} \ 41 | --num_train_steps=${num_train_steps} \ 42 | --num_warmup_steps=10000 \ 43 | --save_checkpoints_steps=${save_checkpoint_steps} \ 44 | --iterations_per_loop=${save_checkpoint_steps} \ 45 | --use_tpu=True \ 46 | --tpu_name=$(hostname) \ 47 | --num_tpu_cores=$num_tpu_cores \ 48 | --init_checkpoint=${init_checkpoint} -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /discrimination/README.md: -------------------------------------------------------------------------------- 1 | # discrimination 2 | 3 | This folder contains code for the discrimination experiments. 4 | 5 | `run_discrimination.py` can be used to train or evaluate a model for discrimination 6 | 7 | # Discrimination checkpoints 8 | Here are links to the discrimination checkpoints. You'll need to use google cloud storage to download these. 9 | 10 | **NOTE**: These checkpoints were trained on 5000 examples from a specific Grover generator, with a specific nucleus sampling top-p setting. As a result, these aren't necessarily the best discrimination checkpoints, nor are they the most general. The reason we used this experimental setup is outlined [in the paper](https://arxiv.org/abs/1905.12616) -- we assumed limited access to the generator. We did [later experiments](https://medium.com/ai2-blog/counteracting-neural-disinformation-with-grover-6cf6690d463b) and found that if you assume, say, 100k examples from a generator, you'll do much better (up to around 97% accuracy). 11 | 12 | In other words, if you want to mimic my experimental setup, but with your own generator, you'd also need to train your own discriminator from scratch. Alternatively, if you want a really good discriminator against my checkpoints for whatever reason, you'd also probably want to train your own discriminator from scratch. 13 | 14 | Medium trained on medium, top-p=0.96: 15 | ``` 16 | gs://grover-models/discrimination/generator=medium~discriminator=grover~discsize=medium~dataset=p=0.96/model.ckpt-1562.data-00000-of-00001 17 | gs://grover-models/discrimination/generator=medium~discriminator=grover~discsize=medium~dataset=p=0.96/model.ckpt-1562.index 18 | gs://grover-models/discrimination/generator=medium~discriminator=grover~discsize=medium~dataset=p=0.96/model.ckpt-1562.meta 19 | ``` 20 | 21 | Mega trained on mega, top-p=0.94: 22 | ``` 23 | gs://grover-models/discrimination/generator=mega~discriminator=grover~discsize=mega~dataset=p=0.94/model.ckpt-1562.data-00000-of-00001 24 | gs://grover-models/discrimination/generator=mega~discriminator=grover~discsize=mega~dataset=p=0.94/model.ckpt-1562.index 25 | gs://grover-models/discrimination/generator=mega~discriminator=grover~discsize=mega~dataset=p=0.94/model.ckpt-1562.meta 26 | ``` -------------------------------------------------------------------------------- /realnews/cc_files.txt: -------------------------------------------------------------------------------- 1 | s3://commoncrawl/crawl-data/CC-MAIN-2013-20 2 | s3://commoncrawl/crawl-data/CC-MAIN-2013-48 3 | s3://commoncrawl/crawl-data/CC-MAIN-2014-10 4 | s3://commoncrawl/crawl-data/CC-MAIN-2014-15 5 | s3://commoncrawl/crawl-data/CC-MAIN-2014-23 6 | s3://commoncrawl/crawl-data/CC-MAIN-2014-35 7 | s3://commoncrawl/crawl-data/CC-MAIN-2014-41 8 | s3://commoncrawl/crawl-data/CC-MAIN-2014-42 9 | s3://commoncrawl/crawl-data/CC-MAIN-2014-49 10 | s3://commoncrawl/crawl-data/CC-MAIN-2014-52 11 | s3://commoncrawl/crawl-data/CC-MAIN-2015-06 12 | s3://commoncrawl/crawl-data/CC-MAIN-2015-11 13 | s3://commoncrawl/crawl-data/CC-MAIN-2015-14 14 | s3://commoncrawl/crawl-data/CC-MAIN-2015-18 15 | s3://commoncrawl/crawl-data/CC-MAIN-2015-22 16 | s3://commoncrawl/crawl-data/CC-MAIN-2015-27 17 | s3://commoncrawl/crawl-data/CC-MAIN-2015-32 18 | s3://commoncrawl/crawl-data/CC-MAIN-2015-35 19 | s3://commoncrawl/crawl-data/CC-MAIN-2015-40 20 | s3://commoncrawl/crawl-data/CC-MAIN-2015-48 21 | s3://commoncrawl/crawl-data/CC-MAIN-2016-07 22 | s3://commoncrawl/crawl-data/CC-MAIN-2016-18 23 | s3://commoncrawl/crawl-data/CC-MAIN-2016-22 24 | s3://commoncrawl/crawl-data/CC-MAIN-2016-26 25 | s3://commoncrawl/crawl-data/CC-MAIN-2016-30 26 | s3://commoncrawl/crawl-data/CC-MAIN-2016-36 27 | s3://commoncrawl/crawl-data/CC-MAIN-2016-40 28 | s3://commoncrawl/crawl-data/CC-MAIN-2016-44 29 | s3://commoncrawl/crawl-data/CC-MAIN-2016-50 30 | s3://commoncrawl/crawl-data/CC-MAIN-2017-04 31 | s3://commoncrawl/crawl-data/CC-MAIN-2017-09 32 | s3://commoncrawl/crawl-data/CC-MAIN-2017-13 33 | s3://commoncrawl/crawl-data/CC-MAIN-2017-17 34 | s3://commoncrawl/crawl-data/CC-MAIN-2017-22 35 | s3://commoncrawl/crawl-data/CC-MAIN-2017-26 36 | s3://commoncrawl/crawl-data/CC-MAIN-2017-30 37 | s3://commoncrawl/crawl-data/CC-MAIN-2017-34 38 | s3://commoncrawl/crawl-data/CC-MAIN-2017-39 39 | s3://commoncrawl/crawl-data/CC-MAIN-2017-43 40 | s3://commoncrawl/crawl-data/CC-MAIN-2017-47 41 | s3://commoncrawl/crawl-data/CC-MAIN-2017-51 42 | s3://commoncrawl/crawl-data/CC-MAIN-2018-05 43 | s3://commoncrawl/crawl-data/CC-MAIN-2018-09 44 | s3://commoncrawl/crawl-data/CC-MAIN-2018-13 45 | s3://commoncrawl/crawl-data/CC-MAIN-2018-17 46 | s3://commoncrawl/crawl-data/CC-MAIN-2018-22 47 | s3://commoncrawl/crawl-data/CC-MAIN-2018-26 48 | s3://commoncrawl/crawl-data/CC-MAIN-2018-30 49 | s3://commoncrawl/crawl-data/CC-MAIN-2018-34 50 | s3://commoncrawl/crawl-data/CC-MAIN-2018-39 51 | s3://commoncrawl/crawl-data/CC-MAIN-2018-43 52 | s3://commoncrawl/crawl-data/CC-MAIN-2018-47 53 | s3://commoncrawl/crawl-data/CC-MAIN-2018-51 54 | s3://commoncrawl/crawl-data/CC-MAIN-2019-04 55 | s3://commoncrawl/crawl-data/CC-MAIN-2019-09 56 | s3://commoncrawl/crawl-data/CC-MAIN-2019-13 57 | -------------------------------------------------------------------------------- /generation_examples/compute_accuracy_script.py: -------------------------------------------------------------------------------- 1 | """ 2 | You can use this script to compute the accuracy given a dataset like 3 | * generations_p=0.96.jsonl 4 | 5 | and also a numpy array of machine/human probabilities that is the same size as the val and test sets. 6 | """ 7 | import json 8 | import numpy as np 9 | import os 10 | import pandas as pd 11 | 12 | # Load in the dataset 13 | set_to_info = {'val': [], 'test': []} 14 | with open('generator=mega~dataset=p0.94.jsonl', 'r') as f: 15 | for x in f: 16 | item = json.loads(x) 17 | if item['split'] == 'train': 18 | continue 19 | set_to_info[item['split']].append(item) 20 | 21 | # Load in the probabilities 22 | 23 | SAVED_PROBS_PATH='generator=mega~discriminator=grover~discsize=mega~dataset=p0.94~test-probs.npy' 24 | assert os.path.exists(SAVED_PROBS_PATH) 25 | probs = np.load(SAVED_PROBS_PATH) 26 | 27 | ############################ OK NOW HERE'S WHERE IT GETS INTERESTING 28 | def score(probs, full_info): 29 | score_df = pd.DataFrame(data=probs, columns=['machine', 'human']) # THIS MUST AGREE 30 | score_df['labels'] = [x['label'] for x in full_info] 31 | score_df['orig_split'] = [x['orig_split'] for x in full_info] 32 | score_df['ind30k'] = [x['ind30k'] for x in full_info] 33 | score_df.index.name = 'raw_index' 34 | score_df.reset_index(inplace=True) 35 | 36 | # So really there are 3 groups here: 37 | # HUMAN WRITTEN ARTICLE (the "burner") 38 | # MACHINE WRITTEN ARTICLE PAIRED WITH HUMAN WRITTEN ARTICLE 39 | # For evaluation we want a 50:50 split between human and machine generations, meaning we need to take out the 40 | # burner part. 41 | groups = {k:v for k, v in score_df.groupby('orig_split')} 42 | unpaired_human = groups.pop('train_burner') 43 | 44 | machine_v_human = {k: v.set_index('ind30k', drop=True) for k, v in groups['gen'].groupby('labels')} 45 | machine_vs_human_joined = machine_v_human['machine'].join(machine_v_human['human'], rsuffix='_humanpair') 46 | machine_vs_human_joined['is_right'] = machine_vs_human_joined['machine'] > machine_vs_human_joined['machine_humanpair'] 47 | 48 | combined_scores = pd.concat(( 49 | unpaired_human[['machine', 'human', 'labels']], 50 | machine_vs_human_joined[['machine', 'human', 'labels']], 51 | ),0) 52 | combined_acc = np.mean(combined_scores[['machine', 'human']].idxmax(1) == combined_scores['labels']) 53 | 54 | stats = { 55 | 'paired_acc': np.mean(machine_vs_human_joined['is_right']), 56 | 'unpaired_acc': combined_acc, 57 | } 58 | return stats 59 | 60 | # Compute the validation stats 61 | val_stats = score(probs, set_to_info['test']) 62 | print(val_stats) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Grover 2 | ##### UPDATE, Sept 17 2019. We got into NeurIPS (camera ready coming soon!) and we've made Grover-Mega publicly available without you needing to fill out the form. You can download it using [download_model.py](download_model.py). 3 | 4 | (aka, code for [Defending Against Neural Fake News](https://arxiv.org/abs/1905.12616)) 5 | 6 | Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks. 7 | 8 | Visit our project page at [rowanzellers.com/grover](https://rowanzellers.com/grover), [the AI2 online demo](https://grover.allenai.org), or read the full paper at [arxiv.org/abs/1905.12616](https://arxiv.org/abs/1905.12616). 9 | 10 | ![teaser](https://i.imgur.com/VAGFpBe.png "teaser") 11 | 12 | ## What's in this repo? 13 | 14 | We are releasing the following: 15 | * Code for the Grover generator (in [lm/](lm/)). This involves training the model as a language model across fields. 16 | * Code for the Grover discriminator in [discrimination/](discrimination/). Without much changing, you can run Grover as a discriminator to detect Neural Fake News. 17 | * Code for generating from a Grover model, in [sample/](sample/). 18 | * Code for making your own RealNews dataset in [realnews/](realnews/). 19 | * Model checkpoints freely available online for *all* of the Grover models. For using the RealNews dataset for research, please [submit this form](https://docs.google.com/forms/d/1LMAUeUtHNPXO9koyAIlDpvyKsLSYlrBj3rYhC30a7Ak) and message me on [contact me on Twitter](https://twitter.com/rown) or [through email](https://scr.im/rowan). You will need to use a valid account that has google cloud enabled, otherwise, I won't be able to give you access 😢 20 | 21 | Scroll down 👇 for some easy-to-use instructions for setting up Grover to generate news articles. 22 | 23 | ## Setting up your environment 24 | 25 | *NOTE*: If you just care about making your own RealNews dataset, you will need to set up your environment separately just for that, using an AWS machine (see [realnews/](realnews/).) 26 | 27 | There are a few ways you can run Grover: 28 | * **Generation mode (inference)**. This requires a GPU because I wasn't able to get top-p sampling, or caching of transformer hidden states, to work on a TPU. 29 | * **LM Validation mode (perplexity)**. This could be run on a GPU or a TPU, but I've only tested this with TPU inference. 30 | * **LM Training mode**. This requires a large TPU pod. 31 | * **Discrimination mode (training)**. This requires a TPU pod. 32 | * **Discrimination mode (inference)**. This could be run on a GPU or a TPU, but I've only tested this with TPU inference. 33 | 34 | **NOTE**: You might be able to get things to work using different hardware. However, it might be a lot of work engineering wise and I don't recommend it if possible. Please don't contact me with requests like this, as there's not much help I can give you. 35 | 36 | I used Python3.6 for everything. Usually I set it up using the following commands: 37 | ``` 38 | curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh && \ 39 | chmod +x ~/miniconda.sh && \ 40 | ~/miniconda.sh -b -p ~/conda && \ 41 | rm ~/miniconda.sh && \ 42 | ~/conda/bin/conda install -y python=3.6 43 | ``` 44 | Then `pip install -r requirements-gpu.txt` if you're installing on a GPU, or `pip install requirements-tpu.txt` for TPU. 45 | 46 | Misc notes/tips: 47 | * If you have a lot of projects on your machine, you might want to use an anaconda environment to handle them all. Use `conda create -n grover python=3.6` to create an environment named `grover`. To enter the environment use `source activate grover`. To leave use `source deactivate`. 48 | * I'm using tensorflow `1.13.1` which requires Cuda `10.0`. You'll need to install that from the nvidia website. I usually install it into `/usr/local/cuda-10.0/`, so you will need to run `export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64` so tensorflow knows where to find it. 49 | * I always have my pythonpath as the root directory. While in the `grover` directory, run `export PYTHONPATH=$(pwd)` to set it. 50 | 51 | ## Quickstart: setting up Grover for generation! 52 | 53 | 1. Set up your environment. Here's the easy way, assuming anaconda is installed: `conda create -y -n grover python=3.6 && source activate grover && pip install -r requirements-gpu.txt` 54 | 2. Download the model using `python download_model.py base` 55 | 3. Now generate: `PYTHONPATH=$(pwd) python sample/contextual_generate.py -model_config_fn lm/configs/base.json -model_ckpt models/base/model.ckpt -metadata_fn sample/april2019_set_mini.jsonl -out_fn april2019_set_mini_out.jsonl` 56 | 57 | Congrats! You can view the generations, conditioned on the domain/headline/date/authors, in `april2019_set_mini_out.jsonl`. 58 | 59 | ## FAQ: What's the deal with the release of Grover? 60 | 61 | Our core position is that [it is important to release possibly-dangerous models to researchers](https://thegradient.pub/why-we-released-grover/). At the same time, we believe Grover-Mega isn't particularly useful to anyone who isn't doing research in this area, particularly as [we have an online web demo available](https://grover.allenai.org/) and the model is computationally expensive. We previously were a bit stricter and limited initial use of Grover-Mega to researchers. Now that several months have passed since we put the paper on arxiv, and since several other large-scale language models have been publicly released, we figured that there is little harm in fully releasing Grover-Mega. 62 | 63 | ### Bibtex 64 | 65 | ``` 66 | @inproceedings{zellers2019grover, 67 | title={Defending Against Neural Fake News}, 68 | author={Zellers, Rowan and Holtzman, Ari and Rashkin, Hannah and Bisk, Yonatan and Farhadi, Ali and Roesner, Franziska and Choi, Yejin}, 69 | booktitle={Advances in Neural Information Processing Systems 32}, 70 | year={2019} 71 | } 72 | ``` 73 | -------------------------------------------------------------------------------- /lm/train.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | """ Training script! """ 17 | 18 | import tensorflow as tf 19 | 20 | from lm.dataloader import input_fn_builder 21 | from lm.modeling import model_fn_builder, GroverConfig 22 | 23 | flags = tf.flags 24 | 25 | FLAGS = flags.FLAGS 26 | 27 | ## Required parameters 28 | flags.DEFINE_string( 29 | "config_file", 'configs/base.json', 30 | "The config json file corresponding to the pre-trained news model. " 31 | "This specifies the model architecture.") 32 | 33 | flags.DEFINE_string( 34 | "input_file", None, 35 | "Input TF example files (can be a glob or comma separated).") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | ## Other parameters 42 | flags.DEFINE_string( 43 | "init_checkpoint", None, 44 | "Initial checkpoint (usually from a pre-trained model).") 45 | 46 | flags.DEFINE_integer( 47 | "max_seq_length", 1024, 48 | "The maximum total input sequence length after BPE tokenization. " 49 | "Sequences longer than this will be truncated, and sequences shorter " 50 | "than this will be padded. Must match data generation.") 51 | 52 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 53 | 54 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for adafactor.") 55 | 56 | flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.") 57 | 58 | flags.DEFINE_integer("num_warmup_steps", 10000, "Number of warmup steps.") 59 | 60 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 61 | "How often to save the model checkpoint.") 62 | 63 | flags.DEFINE_integer("iterations_per_loop", 1000, 64 | "How many steps to make in each estimator call.") 65 | 66 | flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.") 67 | 68 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 69 | 70 | flags.DEFINE_string( 71 | "tpu_name", None, 72 | "The Cloud TPU to use for training. This should be either the name " 73 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 74 | "url.") 75 | 76 | flags.DEFINE_string( 77 | "tpu_zone", None, 78 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 79 | "specified, we will attempt to automatically detect the GCE project from " 80 | "metadata.") 81 | 82 | flags.DEFINE_string( 83 | "gcp_project", None, 84 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 85 | "specified, we will attempt to automatically detect the GCE project from " 86 | "metadata.") 87 | 88 | flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 89 | 90 | flags.DEFINE_integer( 91 | "num_tpu_cores", 8, 92 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 93 | 94 | 95 | def main(_): 96 | tf.logging.set_verbosity(tf.logging.INFO) 97 | 98 | news_config = GroverConfig.from_json_file(FLAGS.config_file) 99 | 100 | tf.gfile.MakeDirs(FLAGS.output_dir) 101 | 102 | input_files = [] 103 | for input_pattern in FLAGS.input_file.split(","): 104 | input_files.extend(tf.gfile.Glob(input_pattern)) 105 | 106 | tf.logging.info("*** Input Files ***") 107 | for input_file in input_files: 108 | tf.logging.info(" %s" % input_file) 109 | 110 | tpu_cluster_resolver = None 111 | if FLAGS.use_tpu and FLAGS.tpu_name: 112 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 113 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 114 | 115 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 116 | run_config = tf.contrib.tpu.RunConfig( 117 | cluster=tpu_cluster_resolver, 118 | master=FLAGS.master, 119 | model_dir=FLAGS.output_dir, 120 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 121 | keep_checkpoint_max=None, 122 | tpu_config=tf.contrib.tpu.TPUConfig( 123 | iterations_per_loop=FLAGS.iterations_per_loop, 124 | num_shards=FLAGS.num_tpu_cores, 125 | per_host_input_for_training=is_per_host)) 126 | 127 | model_fn = model_fn_builder(news_config, init_checkpoint=FLAGS.init_checkpoint, 128 | learning_rate=FLAGS.learning_rate, 129 | num_train_steps=FLAGS.num_train_steps, 130 | num_warmup_steps=FLAGS.num_warmup_steps, 131 | use_tpu=FLAGS.use_tpu, 132 | ) 133 | 134 | # If TPU is not available, this will fall back to normal Estimator on CPU 135 | # or GPU. 136 | estimator = tf.contrib.tpu.TPUEstimator( 137 | use_tpu=FLAGS.use_tpu, 138 | model_fn=model_fn, 139 | config=run_config, 140 | train_batch_size=FLAGS.train_batch_size, 141 | eval_batch_size=FLAGS.train_batch_size, 142 | params={'model_dir': FLAGS.output_dir} 143 | ) 144 | 145 | tf.logging.info("***** Running training *****") 146 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 147 | train_input_fn = input_fn_builder( 148 | input_files=input_files, 149 | seq_length=FLAGS.max_seq_length, 150 | is_training=True) 151 | 152 | estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) 153 | 154 | if __name__ == "__main__": 155 | flags.mark_flag_as_required("input_file") 156 | flags.mark_flag_as_required("output_dir") 157 | tf.app.run() 158 | -------------------------------------------------------------------------------- /sample/contextual_generate.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import sys 4 | import json 5 | 6 | sys.path.append('../') 7 | from lm.modeling import GroverModel, GroverConfig, _top_p_sample, sample 8 | from sample.encoder import get_encoder, format_context, _tokenize_article_pieces, extract_generated_target 9 | from tqdm import tqdm 10 | 11 | import argparse 12 | 13 | parser = argparse.ArgumentParser(description='Contextual generation (aka given some metadata we will generate articles') 14 | parser.add_argument( 15 | '-metadata_fn', 16 | dest='metadata_fn', 17 | type=str, 18 | help='Path to a JSONL containing metadata', 19 | ) 20 | parser.add_argument( 21 | '-out_fn', 22 | dest='out_fn', 23 | type=str, 24 | help='Out jsonl, which will contain the completed jsons', 25 | ) 26 | parser.add_argument( 27 | '-model_config_fn', 28 | dest='model_config_fn', 29 | default='../lm/configs/base.json', 30 | type=str, 31 | help='Configuration JSON for the model', 32 | ) 33 | parser.add_argument( 34 | '-model_ckpt', 35 | dest='model_ckpt', 36 | default='../models/base/model.ckpt', 37 | type=str, 38 | help='checkpoint file for the model', 39 | ) 40 | parser.add_argument( 41 | '-target', 42 | dest='target', 43 | default='article', 44 | type=str, 45 | help='What to generate for each item in metadata_fn. can be article (body), title, etc.', 46 | ) 47 | parser.add_argument( 48 | '-batch_size', 49 | dest='batch_size', 50 | default=1, 51 | type=int, 52 | help='How many things to generate per context. will split into chunks if need be', 53 | ) 54 | parser.add_argument( 55 | '-num_folds', 56 | dest='num_folds', 57 | default=1, 58 | type=int, 59 | help='Number of folds. useful if we want to split up a big file into multiple jobs.', 60 | ) 61 | parser.add_argument( 62 | '-fold', 63 | dest='fold', 64 | default=0, 65 | type=int, 66 | help='which fold we are on. useful if we want to split up a big file into multiple jobs.' 67 | ) 68 | parser.add_argument( 69 | '-max_batch_size', 70 | dest='max_batch_size', 71 | default=None, 72 | type=int, 73 | help='max batch size. You can leave this out and we will infer one based on the number of hidden layers', 74 | ) 75 | parser.add_argument( 76 | '-top_p', 77 | dest='top_p', 78 | default=0.95, 79 | type=float, 80 | help='p to use for top p sampling. if this isn\'t none, use this for everthing' 81 | ) 82 | 83 | args = parser.parse_args() 84 | 85 | encoder = get_encoder() 86 | news_config = GroverConfig.from_json_file(args.model_config_fn) 87 | 88 | # We might have to split the batch into multiple chunks if the batch size is too large 89 | default_mbs = {12: 32, 24: 16, 48: 3} 90 | max_batch_size = args.max_batch_size if args.max_batch_size is not None else default_mbs[news_config.num_hidden_layers] 91 | 92 | # factorize args.batch_size = (num_chunks * batch_size_per_chunk) s.t. batch_size_per_chunk < max_batch_size 93 | num_chunks = int(np.ceil(args.batch_size / max_batch_size)) 94 | batch_size_per_chunk = int(np.ceil(args.batch_size / num_chunks)) 95 | print("\n~~\nbatch size={}, max batch size={}, num chunks={}, batch size per chunk={}\n~~\n".format( 96 | args.batch_size, max_batch_size, num_chunks, batch_size_per_chunk), flush=True) 97 | 98 | # This controls the top p for each generation. 99 | top_p = np.ones((num_chunks, batch_size_per_chunk), dtype=np.float32) * args.top_p 100 | 101 | with open(args.metadata_fn, 'r') as f: 102 | articles = [json.loads(l) for i, l in enumerate(f) if i % args.num_folds == args.fold] 103 | 104 | tf_config = tf.ConfigProto(allow_soft_placement=True) 105 | 106 | with tf.Session(config=tf_config, graph=tf.Graph()) as sess, \ 107 | open(args.out_fn, 'w') as f_out: 108 | initial_context = tf.placeholder(tf.int32, [batch_size_per_chunk, None]) 109 | p_for_topp = tf.placeholder(tf.float32, [batch_size_per_chunk]) 110 | eos_token = tf.placeholder(tf.int32, []) 111 | ignore_ids = tf.placeholder(tf.bool, [news_config.vocab_size]) 112 | tokens, probs = sample(news_config=news_config, initial_context=initial_context, 113 | eos_token=eos_token, ignore_ids=ignore_ids, p_for_topp=p_for_topp, 114 | do_topk=False) 115 | 116 | saver = tf.train.Saver() 117 | saver.restore(sess, args.model_ckpt) 118 | 119 | # Let's go! 120 | for i, article in enumerate(tqdm(articles)): 121 | article_pieces = _tokenize_article_pieces(encoder, article) 122 | context_formatted = [] 123 | for key in ['domain', 'date', 'authors', 'title', 'article']: 124 | if key != args.target: 125 | context_formatted.extend(article_pieces.pop(key, [])) 126 | 127 | if len(context_formatted) >= 1020: 128 | print( 129 | "WARNING: the provided context is {} tokens, but the maximum length Grover was trained on was 1024 tokens.".format( 130 | len(context_formatted)), flush=True) 131 | context_formatted = context_formatted[:1020] 132 | 133 | context_formatted.append(encoder.__dict__['begin_{}'.format(args.target)]) 134 | # Format context end 135 | 136 | # Indices we definitely DONT WANT TO PREDICT 137 | ignore_ids_np = np.array(encoder.special_tokens_onehot) 138 | ignore_ids_np[encoder.__dict__['end_{}'.format(args.target)]] = 0 139 | 140 | gens = [] 141 | gens_raw = [] 142 | gen_probs = [] 143 | 144 | article['top_ps'] = top_p.reshape(-1).tolist() 145 | for chunk_i in range(num_chunks): 146 | tokens_out, probs_out = sess.run([tokens, probs], 147 | feed_dict={initial_context: [context_formatted] * batch_size_per_chunk, 148 | eos_token: encoder.__dict__['end_{}'.format(args.target)], 149 | ignore_ids: ignore_ids_np, 150 | p_for_topp: top_p[chunk_i]}) 151 | 152 | for t_i, p_i in zip(tokens_out, probs_out): 153 | extraction = extract_generated_target(output_tokens=t_i, encoder=encoder, target=args.target) 154 | gens.append(extraction['extraction']) 155 | 156 | # NOTE: Originally I didn't add the +1 which meant that end article was being cut off. whoops. 157 | # better add that! 158 | gens_raw.append(t_i[extraction['start_ind']:extraction['end_ind'] + 1].tolist()) 159 | 160 | assert extraction['start_ind'] == len(context_formatted) 161 | gen_probs.append(p_i[:extraction['end_ind'] - len(context_formatted) + 1].tolist()) 162 | 163 | article['gens_{}'.format(args.target)] = gens 164 | article['gensraw_{}'.format(args.target)] = gens_raw 165 | article['probs_{}'.format(args.target)] = gen_probs 166 | 167 | # these were in there for whatever reason... 168 | article.pop('input_ids_conditional', None) 169 | article.pop('input_ids_unconditional', None) 170 | f_out.write(json.dumps(article) + '\n') 171 | print("Written {}/{} articles".format(i, len(articles)), flush=True) 172 | -------------------------------------------------------------------------------- /lm/dataloader.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import collections 17 | import tensorflow as tf 18 | 19 | 20 | def _decode_record(record, name_to_features): 21 | """Decodes a record to a TensorFlow example.""" 22 | example = tf.parse_single_example(record, name_to_features) 23 | 24 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 25 | # So cast all int64 to int32. 26 | for name in list(example.keys()): 27 | t = example[name] 28 | if t.dtype == tf.int64: 29 | t = tf.cast(t, tf.int32) 30 | example[name] = t 31 | return example 32 | 33 | 34 | def input_fn_builder(input_files, 35 | seq_length, 36 | is_training, 37 | num_cpu_threads=4, 38 | evaluate_for_fixed_number_of_steps=True): 39 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 40 | 41 | def input_fn(params): 42 | """The actual input function.""" 43 | batch_size = params["batch_size"] 44 | name_to_features = { 45 | "input_ids": tf.FixedLenFeature([seq_length + 1], tf.int64), 46 | } 47 | 48 | # For training, we want a lot of parallel reading and shuffling. 49 | # For eval, we want no shuffling and parallel reading doesn't matter. 50 | if is_training: 51 | d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) 52 | d = d.repeat() 53 | d = d.shuffle(buffer_size=len(input_files)) 54 | 55 | # `cycle_length` is the number of parallel files that get read. 56 | cycle_length = min(num_cpu_threads, len(input_files)) 57 | 58 | # `sloppy` mode means that the interleaving is not exact. This adds 59 | # even more randomness to the training pipeline. 60 | d = d.apply( 61 | tf.data.experimental.parallel_interleave( 62 | tf.data.TFRecordDataset, 63 | sloppy=is_training, 64 | cycle_length=cycle_length)) 65 | d = d.shuffle(buffer_size=100) 66 | else: 67 | d = tf.data.TFRecordDataset(input_files) 68 | # If we evaluate for a fixed number of steps we don't want to encounter 69 | # out-of-range exceptions. 70 | if evaluate_for_fixed_number_of_steps: 71 | d = d.repeat() 72 | 73 | # We must `drop_remainder` on training because the TPU requires fixed 74 | # size dimensions. For eval, we assume we are evaluating on the CPU or GPU 75 | # and we *don't* want to drop the remainder, otherwise we wont cover 76 | # every sample. 77 | d = d.apply( 78 | tf.data.experimental.map_and_batch( 79 | lambda record: _decode_record(record, name_to_features), 80 | batch_size=batch_size, 81 | num_parallel_batches=num_cpu_threads, 82 | drop_remainder=True)) 83 | return d 84 | 85 | return input_fn 86 | 87 | 88 | # ~~~~~~~~~~~~~~ This is for classification / AF ~~~~~~~~~~~~~~~~~~ 89 | def classification_convert_examples_to_features( 90 | examples, max_seq_length, batch_size, encoder, output_file, labels, pad_extra_examples=False, 91 | chop_from_front_if_needed=True): 92 | """Convert a set of `InputExample`s to a TFRecord file.""" 93 | 94 | writer = tf.python_io.TFRecordWriter(output_file) 95 | 96 | label_map = {label: i for i, label in enumerate(labels)} 97 | 98 | for (ex_index, example) in enumerate(examples): 99 | if ex_index % 10000 == 0: 100 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 101 | 102 | # begin_summary is our [CLS] token 103 | tokens = example['ids'] + [encoder.begin_summary] 104 | 105 | if len(tokens) > max_seq_length: 106 | if chop_from_front_if_needed: 107 | tokens = tokens[-max_seq_length:] 108 | else: 109 | tokens = example['ids'][:(max_seq_length-1)] + [encoder.begin_summary] 110 | elif len(tokens) < max_seq_length: 111 | tokens.extend([encoder.padding] * (max_seq_length - len(tokens))) 112 | 113 | features = collections.OrderedDict() 114 | features['input_ids'] = tf.train.Feature(int64_list=tf.train.Int64List(value=tokens)) 115 | features['label_ids'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[label_map[example['label']]])) 116 | features['is_real_example'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[1])) 117 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 118 | writer.write(tf_example.SerializeToString()) 119 | 120 | if pad_extra_examples: 121 | for x in range(len(examples) % batch_size): 122 | features = collections.OrderedDict() 123 | features['input_ids'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[0]*max_seq_length)) 124 | features['label_ids'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[0])) 125 | features['is_real_example'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[0])) 126 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 127 | writer.write(tf_example.SerializeToString()) 128 | writer.close() 129 | 130 | 131 | def classification_input_fn_builder(input_file, seq_length, is_training, 132 | drop_remainder, 133 | buffer_size=100): 134 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 135 | 136 | name_to_features = { 137 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 138 | "label_ids": tf.FixedLenFeature([], tf.int64), 139 | "is_real_example": tf.FixedLenFeature([], tf.int64), 140 | } 141 | 142 | def input_fn(params): 143 | """The actual input function.""" 144 | batch_size = params["batch_size"] 145 | 146 | # For training, we want a lot of parallel reading and shuffling. 147 | # For eval, we want no shuffling and parallel reading doesn't matter. 148 | d = tf.data.TFRecordDataset(input_file) 149 | if is_training: 150 | d = d.repeat() 151 | d = d.shuffle(buffer_size=buffer_size) 152 | 153 | d = d.apply( 154 | tf.data.experimental.map_and_batch( 155 | lambda record: _decode_record(record, name_to_features), 156 | batch_size=batch_size, 157 | drop_remainder=drop_remainder)) 158 | 159 | return d 160 | 161 | return input_fn 162 | -------------------------------------------------------------------------------- /lm/validate.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import os 17 | from lm.modeling import model_fn_builder, GroverConfig 18 | import tensorflow as tf 19 | from lm.dataloader import input_fn_builder 20 | import numpy as np 21 | import tempfile 22 | import h5py 23 | from google.cloud import storage 24 | 25 | flags = tf.flags 26 | 27 | FLAGS = flags.FLAGS 28 | 29 | ## Required parameters 30 | flags.DEFINE_string( 31 | "config_file", 'configs/base.json', 32 | "The config json file corresponding to the pre-trained news model. " 33 | "This specifies the model architecture.") 34 | 35 | flags.DEFINE_string( 36 | "input_file", None, 37 | "Input TF example files (can be a glob or comma separated).") 38 | 39 | flags.DEFINE_string( 40 | "output_dir", None, 41 | "The output directory where the model checkpoints will be written.") 42 | 43 | flags.DEFINE_string( 44 | "validation_name", 'preds.h5', 45 | "Name to use") 46 | 47 | ## Other parameters 48 | flags.DEFINE_string( 49 | "init_checkpoint", None, 50 | "Initial checkpoint (usually from a pre-trained model).") 51 | 52 | flags.DEFINE_integer( 53 | "max_seq_length", 1024, 54 | "The maximum total input sequence length after WordPiece tokenization. " 55 | "Sequences longer than this will be truncated, and sequences shorter " 56 | "than this will be padded. Must match data generation.") 57 | 58 | flags.DEFINE_integer("iterations_per_loop", 1000, 59 | "How many steps to make in each estimator call.") 60 | 61 | flags.DEFINE_integer("batch_size", 32, "Batch size used for eval") 62 | 63 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 64 | 65 | flags.DEFINE_string( 66 | "tpu_name", None, 67 | "The Cloud TPU to use for training. This should be either the name " 68 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 69 | "url.") 70 | 71 | flags.DEFINE_string( 72 | "tpu_zone", None, 73 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 74 | "specified, we will attempt to automatically detect the GCE project from " 75 | "metadata.") 76 | 77 | flags.DEFINE_string( 78 | "gcp_project", None, 79 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 80 | "specified, we will attempt to automatically detect the GCE project from " 81 | "metadata.") 82 | 83 | flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 84 | 85 | flags.DEFINE_integer( 86 | "num_tpu_cores", 8, 87 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 88 | 89 | 90 | # This is a handy little utility so that we can save the perplexities to TPU 91 | class gcloudwriter(): 92 | def __init__(self, gcloud_name): 93 | assert gcloud_name.startswith('gs://') 94 | self.gcloud_name = gcloud_name 95 | bucket_name, blob_name = gcloud_name.split('gs://')[1].split('/', 1) 96 | bucket = storage.Client().get_bucket(bucket_name) 97 | self.blob = bucket.blob(blob_name) 98 | 99 | def __enter__(self): 100 | self.tempfile = tempfile.NamedTemporaryFile() 101 | return self.tempfile 102 | 103 | def __exit__(self, *args): 104 | self.tempfile.flush() 105 | print("UPLOADING TO {}".format(self.gcloud_name), flush=True) 106 | self.blob.upload_from_filename(self.tempfile.name) 107 | self.tempfile.close() 108 | 109 | 110 | def ind_where(array: np.ndarray, target, return_first_match=True, default_value=-1): 111 | """ 112 | :param array: Single dimension array 113 | :param target: target to search for 114 | :param return_first_match: If true, return the first index that matches, otherwise, return the last one 115 | :param default_value: Index to return if there was no match 116 | :return: index of the first match, or -1 if nothing 117 | """ 118 | assert array.ndim == 1 119 | matching_inds = np.where(array == target)[0] 120 | if len(matching_inds) > 0: 121 | if return_first_match: 122 | return int(matching_inds[0]) 123 | else: 124 | return int(matching_inds[-1]) 125 | return default_value 126 | 127 | 128 | def main(_): 129 | tf.logging.set_verbosity(tf.logging.INFO) 130 | 131 | news_config = GroverConfig.from_json_file(FLAGS.config_file) 132 | 133 | tf.gfile.MakeDirs(FLAGS.output_dir) 134 | 135 | input_files = [] 136 | for input_pattern in FLAGS.input_file.split(","): 137 | input_files.extend(tf.gfile.Glob(input_pattern)) 138 | 139 | tf.logging.info("*** Input Files ***") 140 | for input_file in input_files: 141 | tf.logging.info(" %s" % input_file) 142 | 143 | tpu_cluster_resolver = None 144 | if FLAGS.use_tpu and FLAGS.tpu_name: 145 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 146 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 147 | 148 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 149 | run_config = tf.contrib.tpu.RunConfig( 150 | cluster=tpu_cluster_resolver, 151 | master=FLAGS.master, 152 | model_dir=FLAGS.output_dir, 153 | save_checkpoints_steps=FLAGS.iterations_per_loop, 154 | keep_checkpoint_max=None, 155 | tpu_config=tf.contrib.tpu.TPUConfig( 156 | iterations_per_loop=FLAGS.iterations_per_loop, 157 | num_shards=FLAGS.num_tpu_cores, 158 | per_host_input_for_training=is_per_host)) 159 | 160 | model_fn = model_fn_builder(news_config, 161 | init_checkpoint=FLAGS.init_checkpoint, 162 | learning_rate=1e-4, 163 | num_train_steps=0, 164 | num_warmup_steps=0, 165 | use_tpu=FLAGS.use_tpu, 166 | ) 167 | 168 | # If TPU is not available, this will fall back to normal Estimator on CPU 169 | # or GPU. 170 | estimator = tf.contrib.tpu.TPUEstimator( 171 | use_tpu=FLAGS.use_tpu, 172 | model_fn=model_fn, 173 | config=run_config, 174 | train_batch_size=FLAGS.batch_size, 175 | eval_batch_size=FLAGS.batch_size, 176 | predict_batch_size=FLAGS.batch_size, 177 | params={'model_dir': FLAGS.output_dir} 178 | ) 179 | 180 | eval_input_fn = input_fn_builder( 181 | input_files=input_files, 182 | seq_length=FLAGS.max_seq_length, 183 | evaluate_for_fixed_number_of_steps=False, 184 | num_cpu_threads=1, 185 | is_training=False) 186 | result = [x for x in estimator.predict(input_fn=eval_input_fn, yield_single_examples=True)] 187 | cats = sorted(result[0].keys()) 188 | result_stack = {cat: np.stack([x[cat] for x in result]) for cat in cats} 189 | 190 | with gcloudwriter(os.path.join(FLAGS.output_dir, FLAGS.validation_name)) as tempfile_name: 191 | with h5py.File(tempfile_name, 'w') as h5: 192 | for cat, data in result_stack.items(): 193 | dtype2use = np.float16 if cat.endswith(('logprobs', 'top_p_required')) else np.uint16 194 | h5.create_dataset(cat, data=data.astype(dtype2use)) 195 | h5.create_dataset('model', data=FLAGS.config_file) 196 | h5.create_dataset('ckpt', data=FLAGS.init_checkpoint) 197 | h5.create_dataset('input_file', data=FLAGS.input_file) 198 | 199 | # This gives the perplexity of the entire article. if you want to replicate the results of the paper you 200 | # might need to do something different to extract the ppl of just the body in particular. 201 | ppl_ex = [] 202 | for logprobs_i, ids_i in zip(result_stack['gt_logprobs'], result_stack['labels']): 203 | # Omit the first token. Keep in mind input_ids is shifted by 1 204 | start_ind = ind_where(ids_i, target=50265, default_value=0) 205 | end_ind = ind_where(ids_i, target=50266, default_value=ids_i.shape[0] - 1) 206 | ppl_ex.append(logprobs_i[start_ind:end_ind]) 207 | ppl_ex = np.concatenate(ppl_ex, 0) 208 | print("Article perplexity is {:.3f}".format(np.exp(-np.mean(ppl_ex))), flush=True) 209 | 210 | 211 | if __name__ == "__main__": 212 | flags.mark_flag_as_required("input_file") 213 | flags.mark_flag_as_required("output_dir") 214 | tf.app.run() 215 | -------------------------------------------------------------------------------- /lm/utils.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import collections 17 | import re 18 | 19 | import six 20 | import tensorflow as tf 21 | import numpy as np 22 | from tensorflow.python.lib.io import file_io 23 | 24 | 25 | def _save_np(absolute_fn, array): 26 | if absolute_fn.startswith('gs://'): 27 | with file_io.FileIO(absolute_fn, 'w') as f: 28 | np.save(f, array) 29 | else: 30 | np.save(absolute_fn, array) 31 | 32 | 33 | def assert_rank(tensor, expected_rank, name=None): 34 | """Raises an exception if the tensor rank is not of the expected rank. 35 | 36 | Args: 37 | tensor: A tf.Tensor to check the rank of. 38 | expected_rank: Python integer or list of integers, expected rank. 39 | name: Optional name of the tensor for the error message. 40 | 41 | Raises: 42 | ValueError: If the expected shape doesn't match the actual shape. 43 | """ 44 | if name is None: 45 | name = tensor.name 46 | 47 | expected_rank_dict = {} 48 | if isinstance(expected_rank, six.integer_types): 49 | expected_rank_dict[expected_rank] = True 50 | else: 51 | for x in expected_rank: 52 | expected_rank_dict[x] = True 53 | 54 | actual_rank = tensor.shape.ndims 55 | if actual_rank not in expected_rank_dict: 56 | scope_name = tf.get_variable_scope().name 57 | raise ValueError( 58 | "For the tensor `%s` in scope `%s`, the actual rank " 59 | "`%d` (shape = %s) is not equal to the expected rank `%s`" % 60 | (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) 61 | 62 | 63 | def get_shape_list(tensor, expected_rank=None, name=None): 64 | """Returns a list of the shape of tensor, preferring static dimensions. 65 | 66 | Args: 67 | tensor: A tf.Tensor object to find the shape of. 68 | expected_rank: (optional) int. The expected rank of `tensor`. If this is 69 | specified and the `tensor` has a different rank, and exception will be 70 | thrown. 71 | name: Optional name of the tensor for the error message. 72 | 73 | Returns: 74 | A list of dimensions of the shape of tensor. All static dimensions will 75 | be returned as python integers, and dynamic dimensions will be returned 76 | as tf.Tensor scalars. 77 | """ 78 | if name is None: 79 | name = tensor.name 80 | 81 | if expected_rank is not None: 82 | assert_rank(tensor, expected_rank, name) 83 | 84 | shape = tensor.shape.as_list() 85 | 86 | non_static_indexes = [] 87 | for (index, dim) in enumerate(shape): 88 | if dim is None: 89 | non_static_indexes.append(index) 90 | 91 | if not non_static_indexes: 92 | return shape 93 | 94 | dyn_shape = tf.shape(tensor) 95 | for index in non_static_indexes: 96 | shape[index] = dyn_shape[index] 97 | return shape 98 | 99 | 100 | def gelu(input_tensor): 101 | """Gaussian Error Linear Unit. 102 | 103 | This is a smoother version of the RELU. 104 | Original paper: https://arxiv.org/abs/1606.08415 105 | 106 | Args: 107 | input_tensor: float Tensor to perform activation. 108 | 109 | Returns: 110 | `input_tensor` with the GELU activation applied. 111 | """ 112 | cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0))) 113 | return input_tensor * cdf 114 | 115 | 116 | def layer_norm(input_tensor, name=None, epsilon=1e-5): 117 | """Run layer normalization on the last dimension of the tensor.""" 118 | name2use = f'LayerNorm_{name}' if name is not None else name 119 | with tf.variable_scope(name2use, default_name='LayerNorm'): 120 | dim = input_tensor.shape[-1].value 121 | gamma = tf.get_variable('gamma', [dim], initializer=tf.constant_initializer(1)) 122 | beta = tf.get_variable('beta', [dim], initializer=tf.constant_initializer(0)) 123 | mean = tf.reduce_mean(input_tensor, axis=-1, keepdims=True) 124 | std = tf.reduce_mean(tf.square(input_tensor - mean), axis=-1, keepdims=True) 125 | input_tensor = (input_tensor - mean) * tf.rsqrt(std + epsilon) 126 | input_tensor = input_tensor * gamma + beta 127 | return input_tensor 128 | 129 | 130 | def dropout(input_tensor, dropout_prob): 131 | """Perform dropout. 132 | 133 | Args: 134 | input_tensor: float Tensor. 135 | dropout_prob: Python float. The probability of dropping out a value (NOT of 136 | *keeping* a dimension as in `tf.nn.dropout`). 137 | 138 | Returns: 139 | A version of `input_tensor` with dropout applied. 140 | """ 141 | if dropout_prob is None or dropout_prob == 0.0: 142 | return input_tensor 143 | output = tf.nn.dropout(input_tensor, rate=dropout_prob) 144 | return output 145 | 146 | 147 | def get_attention_mask(nd, ns, *, dtype): 148 | """ 149 | this is a TPU compatible version of tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd) 150 | where the lower right triangle contains 1s 151 | """ 152 | i = tf.range(nd)[:, None] 153 | j = tf.range(ns) 154 | m = i >= j - ns + nd 155 | return tf.cast(m, dtype) 156 | 157 | 158 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint): 159 | """Compute the union of the current variables and checkpoint variables.""" 160 | assignment_map = {} 161 | initialized_variable_names = {} 162 | 163 | name_to_variable = collections.OrderedDict() 164 | for var in tvars: 165 | name = var.name 166 | m = re.match("^(.*):\\d+$", name) 167 | if m is not None: 168 | name = m.group(1) 169 | name_to_variable[name] = var 170 | 171 | init_vars = tf.train.list_variables(init_checkpoint) 172 | 173 | assignment_map = collections.OrderedDict() 174 | for x in init_vars: 175 | (name, var) = (x[0], x[1]) 176 | if name not in name_to_variable: 177 | continue 178 | assignment_map[name] = name 179 | initialized_variable_names[name] = 1 180 | initialized_variable_names[name + ":0"] = 1 181 | return (assignment_map, initialized_variable_names) 182 | 183 | 184 | def construct_scalar_host_call(metric_dict, model_dir, prefix=""): 185 | """Construct a host call to log scalars when training on TPU. 186 | 187 | Args: 188 | metric_dict: A dict of the tensors to be logged. 189 | model_dir: The location to write the summary. 190 | prefix: The prefix (if any) to prepend to the metric names. 191 | 192 | Returns: 193 | A tuple of (function, args_to_be_passed_to_said_function) 194 | """ 195 | metric_names = list(metric_dict.keys()) 196 | 197 | def host_call_fn(global_step, *args): 198 | """Training host call. Creates scalar summaries for training metrics. 199 | 200 | This function is executed on the CPU and should not directly reference 201 | any Tensors in the rest of the `model_fn`. To pass Tensors from the 202 | model to the `metric_fn`, provide as part of the `host_call`. See 203 | https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimatorSpec 204 | for more information. 205 | 206 | Arguments should match the list of `Tensor` objects passed as the second 207 | element in the tuple passed to `host_call`. 208 | 209 | Args: 210 | global_step: `Tensor with shape `[batch]` for the global_step 211 | *args: Remaining tensors to log. 212 | 213 | Returns: 214 | List of summary ops to run on the CPU host. 215 | """ 216 | step = global_step[0] 217 | with tf.contrib.summary.create_file_writer( 218 | logdir=model_dir, filename_suffix=".host_call").as_default(): 219 | with tf.contrib.summary.always_record_summaries(): 220 | for i, name in enumerate(metric_names): 221 | tf.contrib.summary.scalar(prefix + name, args[i][0], step=step) 222 | 223 | return tf.contrib.summary.all_summary_ops() 224 | 225 | # To log the current learning rate, and gradient norm for Tensorboard, the 226 | # summary op needs to be run on the host CPU via host_call. host_call 227 | # expects [batch_size, ...] Tensors, thus reshape to introduce a batch 228 | # dimension. These Tensors are implicitly concatenated to 229 | # [params['batch_size']]. 230 | global_step_tensor = tf.reshape( 231 | tf.compat.v1.train.get_or_create_global_step(), [1]) 232 | other_tensors = [tf.reshape(metric_dict[key], [1]) for key in metric_names] 233 | 234 | return host_call_fn, [global_step_tensor] + other_tensors 235 | -------------------------------------------------------------------------------- /lm/optimization_adafactor.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | import re 16 | import tensorflow as tf 17 | from lm.utils import get_shape_list 18 | 19 | 20 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): 21 | """Creates an optimizer training op.""" 22 | global_step = tf.train.get_or_create_global_step() 23 | 24 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 25 | 26 | # Implements linear decay of the learning rate. 27 | learning_rate = tf.train.polynomial_decay( 28 | learning_rate, 29 | global_step, 30 | num_train_steps, 31 | end_learning_rate=0.0, 32 | power=1.0, 33 | cycle=False) 34 | 35 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the 36 | # learning rate will be `global_step/num_warmup_steps * init_lr`. 37 | if num_warmup_steps: 38 | global_steps_int = tf.cast(global_step, tf.int32) 39 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 40 | 41 | global_steps_float = tf.cast(global_steps_int, tf.float32) 42 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 43 | 44 | warmup_percent_done = global_steps_float / warmup_steps_float 45 | warmup_learning_rate = init_lr * warmup_percent_done 46 | 47 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 48 | learning_rate = ( 49 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 50 | 51 | # It is recommended that you use this optimizer for fine tuning, since this 52 | # is how the model was trained (note that the Adam m/v variables are NOT 53 | # loaded from init_checkpoint.) 54 | optimizer = AdaFactorOptimizer( 55 | learning_rate=learning_rate, 56 | weight_decay_rate=0.01, 57 | beta_1=0.9, 58 | beta_2=0.999, 59 | epsilon=1e-6, 60 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 61 | 62 | if use_tpu: 63 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) 64 | 65 | tvars = tf.trainable_variables() 66 | grads = tf.gradients(loss, tvars) 67 | 68 | # You could do this, but instead we don't because a) it's slow and b) we already did the 'update clipping' 69 | # (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 70 | 71 | train_op = optimizer.apply_gradients( 72 | zip(grads, tvars), global_step=global_step) 73 | 74 | # Normally the global step update is done inside of `apply_gradients`. 75 | # However, `AdaFactorOptimizer` doesn't do this. But if you use 76 | # a different optimizer, you should probably take this line out. 77 | new_global_step = global_step + 1 78 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 79 | 80 | train_metrics = { 81 | 'learning_rate': learning_rate, 82 | 'minibatch_loss': loss, 83 | # 'minibatch_ppl': tf.math.exp(loss), 84 | } 85 | return train_op, train_metrics 86 | 87 | 88 | class AdaFactorOptimizer(tf.train.Optimizer): 89 | """here's the optimizer we'll use""" 90 | 91 | def __init__(self, 92 | learning_rate, 93 | weight_decay_rate=0.0, 94 | beta_1=0.9, 95 | beta_2=0.999, 96 | epsilon=1e-6, 97 | exclude_from_weight_decay=None, 98 | clipping_rate=1.0, 99 | name="AdaFactorOptimizer"): 100 | """Constructs a AdaFactorOptimizer.""" 101 | super(AdaFactorOptimizer, self).__init__(False, name) 102 | 103 | self.learning_rate = learning_rate 104 | self.weight_decay_rate = weight_decay_rate 105 | self.beta_1 = beta_1 106 | self.beta_2 = beta_2 107 | self.epsilon = epsilon 108 | self.epsilon1 = 1e-30 109 | self.epsilon2 = 0.001 110 | self.clipping_rate = clipping_rate 111 | self.exclude_from_weight_decay = exclude_from_weight_decay 112 | self.use_locking = False 113 | 114 | def _use_factored(self, shape): 115 | return len(shape) >= 2 116 | 117 | def _parameter_scale(self, var): 118 | """Estimate the scale of the parameters from the current values. 119 | We include a minimum value of 0.001 to give it a chance to escape 0 120 | if it was zero-initialized. 121 | Instead of using the value, we could impute the scale from the shape, 122 | as initializers do. 123 | Args: 124 | var: a variable or Tensor. 125 | Returns: 126 | a Scalar 127 | """ 128 | return tf.maximum(reduce_rms(var), self.epsilon2) 129 | 130 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 131 | """See base class.""" 132 | assignments = [] 133 | for (grad, param) in grads_and_vars: 134 | if grad is None or param is None: 135 | continue 136 | 137 | param_name = self._get_variable_name(param.name) 138 | shape_list = get_shape_list(param, expected_rank=[1, 2]) 139 | 140 | # decay_rate = 1 - tf.pow(tf.cast(tf.train.get_or_create_global_step(), tf.float32) + 1.0, -0.8) 141 | decay_rate = self.beta_2 142 | grad_squared = tf.square(grad) + self.epsilon1 143 | 144 | update_scale = self.learning_rate 145 | # update_scale = self.learning_rate * tf.cast(self._parameter_scale(param), dtype=tf.float32) 146 | 147 | # HACK: Make things dependent on grad. 148 | # This confounds the XLA rewriter and keeps it from fusing computations 149 | # across different variables. This fusion is a bad for HBM usage, since 150 | # it causes the gradients to persist in memory. 151 | grad_squared_mean = tf.reduce_mean(grad_squared) 152 | decay_rate += grad_squared_mean * 1e-30 153 | update_scale += grad_squared_mean * 1e-30 154 | 155 | # END HACK 156 | 157 | if self._use_factored(shape_list): 158 | num_rows, num_columns = shape_list 159 | 160 | vr = tf.get_variable( 161 | name=param_name + "/adafactor_vr", 162 | shape=[num_rows], 163 | dtype=tf.float32, 164 | trainable=False, 165 | initializer=tf.zeros_initializer()) 166 | vc = tf.get_variable( 167 | name=param_name + "/adafactor_vc", 168 | shape=[num_columns], 169 | dtype=tf.float32, 170 | trainable=False, 171 | initializer=tf.zeros_initializer()) 172 | 173 | next_vr = decay_rate * vr + (1 - decay_rate) * tf.reduce_mean(grad_squared, 1) 174 | next_vc = decay_rate * vc + (1 - decay_rate) * tf.reduce_mean(grad_squared, 0) 175 | 176 | long_term_mean = tf.reduce_mean(next_vr, -1, keepdims=True) 177 | r_factor = tf.rsqrt(next_vr / long_term_mean + self.epsilon1) 178 | c_factor = tf.rsqrt(next_vc + self.epsilon1) 179 | update = grad * tf.expand_dims(r_factor, -1) * tf.expand_dims(c_factor, -2) 180 | 181 | assignments.append(vr.assign(next_vr, use_locking=self.use_locking)) 182 | assignments.append(vc.assign(next_vc, use_locking=self.use_locking)) 183 | else: 184 | v = tf.get_variable( 185 | name=param_name + "/adafactor_v", 186 | shape=shape_list, 187 | dtype=tf.float32, 188 | trainable=False, 189 | initializer=tf.zeros_initializer()) 190 | next_v = decay_rate * v + (1 - decay_rate) * grad_squared 191 | 192 | assignments.append(v.assign(next_v, use_locking=self.use_locking)) 193 | update = grad * tf.rsqrt(next_v + self.epsilon1) 194 | 195 | clipping_denom = tf.maximum(1.0, reduce_rms(update) / self.clipping_rate) 196 | update /= clipping_denom 197 | 198 | # Do weight decay 199 | # Just adding the square of the weights to the loss function is *not* 200 | # the correct way of using L2 regularization/weight decay with Adam, 201 | # since that will interact with the m and v parameters in strange ways. 202 | # 203 | # Instead we want ot decay the weights in a manner that doesn't interact 204 | # with the m/v parameters. This is equivalent to adding the square 205 | # # of the weights to the loss with plain (non-momentum) SGD. 206 | if self._do_use_weight_decay(param_name): 207 | update += self.weight_decay_rate * param 208 | 209 | update_with_lr = update_scale * update 210 | next_param = param - update_with_lr 211 | 212 | assignments.append(param.assign(next_param, use_locking=self.use_locking)) 213 | return tf.group(*assignments, name=name) 214 | 215 | def _do_use_weight_decay(self, param_name): 216 | """Whether to use L2 weight decay for `param_name`.""" 217 | if not self.weight_decay_rate: 218 | return False 219 | if self.exclude_from_weight_decay: 220 | for r in self.exclude_from_weight_decay: 221 | if re.search(r, param_name) is not None: 222 | return False 223 | return True 224 | 225 | def _get_variable_name(self, param_name): 226 | """Get the variable name from the tensor name.""" 227 | m = re.match("^(.*):\\d+$", param_name) 228 | if m is not None: 229 | param_name = m.group(1) 230 | return param_name 231 | 232 | 233 | def reduce_rms(x): 234 | return tf.sqrt(tf.reduce_mean(tf.square(x))) 235 | -------------------------------------------------------------------------------- /realnews/prepare_lm_data.py: -------------------------------------------------------------------------------- 1 | """ 2 | Turn a merged corpus into tfrecord files. 3 | 4 | NOTE: You will want to do this using several processes. I did this on an AWS machine with 72 CPUs using GNU parallel 5 | as that's where I had the deduplicated RealNews dataset. 6 | """ 7 | import argparse 8 | import ujson as json 9 | from sample.encoder import get_encoder, tokenize_for_grover_training, detokenize, sliding_window, create_int_feature 10 | import random 11 | import tensorflow as tf 12 | import collections 13 | import os 14 | from tempfile import TemporaryDirectory 15 | 16 | parser = argparse.ArgumentParser(description='SCRAPE!') 17 | parser.add_argument( 18 | '-fold', 19 | dest='fold', 20 | default=0, 21 | type=int, 22 | help='which fold we are on' 23 | ) 24 | parser.add_argument( 25 | '-num_folds', 26 | dest='num_folds', 27 | default=1, 28 | type=int, 29 | help='Number of folds (corresponding to both the number of training files and the number of testing files)', 30 | ) 31 | parser.add_argument( 32 | '-seed', 33 | dest='seed', 34 | default=1337, 35 | type=int, 36 | help='which seed to use' 37 | ) 38 | parser.add_argument( 39 | '-base_fn', 40 | dest='base_fn', 41 | default='realnews_', 42 | type=str, 43 | help='We will output files that are like {base_fn}_{n}.tfrecord for n in 0, ..., 1023' 44 | ) 45 | 46 | parser.add_argument( 47 | '-input_fn', 48 | dest='input_fn', 49 | default='realnews.jsonl', 50 | type=str, 51 | help='Base filename to use. THIS MUST BE A LOCAL FILE.' 52 | ) 53 | parser.add_argument( 54 | '-max_seq_length', 55 | dest='max_seq_length', 56 | default=1024, 57 | type=int, 58 | help='Max sequence length', 59 | ) 60 | 61 | parser.add_argument( 62 | '-add_extra_articles_to_end', 63 | dest='add_extra_articles_to_end', 64 | type=bool, 65 | action='store_true', 66 | help='Whether to minimize padding by adding extra articles to the end', 67 | ) 68 | 69 | args = parser.parse_args() 70 | random.seed(args.seed + args.fold) 71 | 72 | encoder = get_encoder() 73 | 74 | 75 | class S3TFRecordWriter(object): 76 | def __init__(self, fn): 77 | self.fn = fn 78 | if fn.startswith('s3://'): 79 | from boto3.s3.transfer import TransferConfig 80 | import boto3 81 | self.gclient = None 82 | self.s3client = boto3.client('s3', 83 | ) 84 | self.storage_dir = TemporaryDirectory() 85 | self.writer = tf.python_io.TFRecordWriter(os.path.join(self.storage_dir.name, 'temp.tfrecord')) 86 | self.bucket_name, self.file_name = self.fn.split('s3://', 1)[1].split('/', 1) 87 | elif fn.startswith('gs://'): 88 | from google.cloud import storage 89 | self.s3client = None 90 | self.gclient = storage.Client() 91 | self.storage_dir = TemporaryDirectory() 92 | self.writer = tf.python_io.TFRecordWriter(os.path.join(self.storage_dir.name, 'temp.tfrecord')) 93 | self.bucket_name, self.file_name = self.fn.split('gs://', 1)[1].split('/', 1) 94 | 95 | else: 96 | self.s3client = None 97 | self.gclient = None 98 | self.bucket_name = None 99 | self.file_name = None 100 | self.storage_dir = None 101 | self.writer = tf.python_io.TFRecordWriter(fn) 102 | 103 | def write(self, x): 104 | self.writer.write(x) 105 | 106 | def close(self): 107 | self.writer.close() 108 | 109 | if self.s3client is not None: 110 | from boto3.s3.transfer import TransferConfig 111 | config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10, 112 | multipart_chunksize=1024 * 25, use_threads=True) 113 | self.s3client.upload_file( 114 | os.path.join(self.storage_dir.name, 'temp.tfrecord'), 115 | self.bucket_name, 116 | self.file_name, 117 | ExtraArgs={'ACL': 'public-read'}, Config=config, 118 | ) 119 | self.storage_dir.cleanup() 120 | if self.gclient is not None: 121 | bucket = self.gclient.get_bucket(self.bucket_name) 122 | blob = bucket.blob(self.file_name) 123 | blob.upload_from_filename(os.path.join(self.storage_dir.name, 'temp.tfrecord')) 124 | self.storage_dir.cleanup() 125 | 126 | def __enter__(self): 127 | # Called when entering "with" context. 128 | return self 129 | 130 | def __exit__(self, *_): 131 | # Called when exiting "with" context. 132 | # Upload shit 133 | print("CALLING CLOSE") 134 | self.close() 135 | 136 | 137 | def article_iterator(encoder, final_desired_size=1025): 138 | """ Iterate through the provided filename + tokenize""" 139 | assert os.path.exists(args.input_fn) 140 | with open(args.input_fn, 'r') as f: 141 | for l_no, l in enumerate(f): 142 | if l_no % args.num_folds == args.fold: 143 | article = json.loads(l) 144 | article['input_ids'] = tokenize_for_grover_training(encoder, article, desired_size=final_desired_size, 145 | unconditional_prob=.35) 146 | article['inst_index'] = (l_no // args.num_folds) 147 | if article['inst_index'] < 100: 148 | print('---\nINPUT{}. {}\n---\nTokens: {}\n'.format(article['inst_index'], 149 | detokenize(encoder, article['input_ids']), 150 | article['input_ids'] 151 | ), flush=True) 152 | if len(article['input_ids']) == 0: 153 | continue 154 | yield article 155 | 156 | 157 | def _stream_from_buffer(buffer, current_desired_size, pad_token=0, add_articles_to_end=False): 158 | """ Combines short articles that are in a buffer """ 159 | random.shuffle(buffer) 160 | i = 0 161 | while i < len(buffer): 162 | article = buffer[i] 163 | if add_articles_to_end: 164 | for article2add in buffer[(i + 1):]: 165 | i += 1 166 | article['input_ids'].append(encoder.padding) 167 | article['input_ids'].append(encoder.reset_context) 168 | article['input_ids'].extend(article2add['input_ids']) 169 | 170 | if len(article['input_ids']) >= current_desired_size: 171 | article['input_ids'] = article['input_ids'][:current_desired_size] 172 | break 173 | # print(f"YIELD FROM BUFFER {i}") 174 | 175 | # Pad to right length 176 | amount_to_pad = current_desired_size - len(article['input_ids']) 177 | article['input_ids'].extend([pad_token] * amount_to_pad) 178 | article['sub_index'] = 0 179 | yield article 180 | i += 1 181 | 182 | 183 | def buffered_and_sliding_window_article_iterator(encoder, current_desired_size, final_desired_size=1025): 184 | """ We apply a sliding window to fix long sequences, and use a buffer that combines short sequences.""" 185 | assert current_desired_size <= final_desired_size 186 | buffer = [] 187 | for article in article_iterator(encoder, final_desired_size=final_desired_size): 188 | amount_to_pad = current_desired_size - len(article['input_ids']) 189 | 190 | if article['split'] == 'val' or amount_to_pad <= 0: 191 | for sub_index, sub_article in enumerate(sliding_window(article, max_seq_length=current_desired_size, 192 | pad_token=encoder.padding)): 193 | sub_article['sub_index'] = sub_index 194 | # print(f"AMT2PAD < 0 YIELD-{inst_index} sliding window {sub_index}", flush=True) 195 | yield sub_article 196 | else: 197 | # Buffer time. 198 | buffer.append(article) 199 | 200 | if len(buffer) % 100 == 0: 201 | yield from _stream_from_buffer(buffer, 202 | current_desired_size=current_desired_size, 203 | pad_token=encoder.padding, 204 | add_articles_to_end=args.add_extra_articles_to_end) 205 | buffer = [] 206 | yield from _stream_from_buffer(buffer, 207 | current_desired_size=current_desired_size, 208 | pad_token=encoder.padding, 209 | add_articles_to_end=args.add_extra_articles_to_end) 210 | 211 | 212 | # OK now write the tfrecord file 213 | total_written = 0 214 | train_file = args.base_fn + 'train{:04d}.tfrecord'.format(args.fold) 215 | val_file = args.base_fn + 'val{:04d}.tfrecord'.format(args.fold) 216 | with S3TFRecordWriter(train_file) as train_writer, S3TFRecordWriter(val_file) as val_writer: 217 | for article in buffered_and_sliding_window_article_iterator(encoder, current_desired_size=args.max_seq_length + 1, 218 | final_desired_size=max(args.max_seq_length + 1, 1025)): 219 | writer2use = train_writer if article['split'] == 'train' else val_writer 220 | assert len(article['input_ids']) == (args.max_seq_length + 1) 221 | 222 | features = collections.OrderedDict() 223 | features["input_ids"] = create_int_feature(article['input_ids']) 224 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 225 | 226 | writer2use.write(tf_example.SerializeToString()) 227 | total_written += 1 228 | 229 | # DEBUG 230 | if article['inst_index'] < 5: 231 | print("~~~\nSubindex{}. Index {}. ARTICLE: {}\n---\nTokens: {}\n\n".format(article['sub_index'], 232 | article['inst_index'], 233 | detokenize(encoder, 234 | article['input_ids']), 235 | article['input_ids']), 236 | flush=True) 237 | if article['inst_index'] % 1000 == 0: 238 | print("{} articles, {} written".format(article['inst_index'], total_written), flush=True) 239 | print("DONE UPLOADING", flush=True) 240 | -------------------------------------------------------------------------------- /realnews/process_ccrawl.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import re 5 | from tempfile import TemporaryFile, NamedTemporaryFile 6 | from urllib.parse import urlparse 7 | 8 | import boto3 9 | import newspaper 10 | import tldextract 11 | from tqdm import tqdm 12 | from warcio import ArchiveIterator 13 | 14 | with open(os.path.join(os.path.dirname(__file__), 'domain_to_allowed_subdomains.json'), 'r') as f: 15 | ALLOWED_SUBDOMAINS = json.load(f) 16 | 17 | # FOR HANNAH 18 | PROPAGANDA_SUBDOMAINS = {'wnd.com': True, 'infowars.com': True, 'breitbart.com': True, 'dailycaller.com': True, 19 | 'yournewswire.com': True, 'prageru.com': True, 'newsmax.com': True, 'twitchy.com': True, 20 | 'dailywire.com': True, 'dailysignal.com': True, 'bigleaguepolitics.com': True, 21 | 'redstate.com': True, 'townhall.com': True, 'bients.com': True, 'thegatewaypundit.com': True, 22 | 'nationalreport.net': True, 'naturalnews.com': True, 'prntly.com': True, 23 | 'worldnewsdailyreport.com': True, 24 | 'libertywriters.com': True, 'globalresearch.ca': True, 25 | } 26 | 27 | BANNED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif', 'php', 'css', 'ico', 'xml', 'woff', 'swf', 'jpg', 'svg', 'ttf', 'tif', 28 | 'bmp', 'js', 'pdf', 'amp', 'rss', 'mp3', 'eot', 'jsp', 'woff2', 'json', 'com', 'axd', 'php3', 29 | 'bin', 'mp4', 'img', 'xhtml', 'dll', 'm4v', 'vov', 'phtml', 'flv', 'pl', 'jpe', 'otf', 'php\'', 30 | 'wmv', 'wav', 'xls', 'doc', 'photo', 'gallery', 'bg', 'ece', 'feed', 'xmlhttp', 'video', 'eml', 31 | 'xnf', 'prt', 'docx', 'file', 'vpx', 'cur', 'data', 'jhtml', 'xlsx', 'map', 'fb', 'webp', 'ppt', 32 | 'rdf', 'bio', 'exe', 'jar', 'net', 'open', 'ogg', 'wma', '7u', 'res', 'dwr', 'pjpeg', 'gz', 'ajax', 33 | 'psd', 'zip', 'coffee', 'tabs', 'cls', 'step', 'jp'} 34 | 35 | BANNED_STRINGS = ['slideshow.', 36 | 'slideshowImage', 'associatedcontent.com', 37 | '/videoid/', 'sodahead.com', 'b92.net', 38 | 'isna.ir', 'prnewswire.com', 'slashdot.org', 'suite101.com', 'tv.com', 'news.yahoo.com', 39 | '/video/', '/image/', 'bbb.org', 'yle.fi', 'ImageId', 'slideshow_files', '/slideshows/', 40 | '/videos/', '/video-', '/videoid/', '/wp-json/', '/search/', 'videoID=', '/portableplayer/', 41 | 'video.aspx', '/allvideo/', 'width=', 'height=', '/PhotoGallery/', 'ArticleSlideshowServlet', 42 | '/storyimage/', '/image.html', '/photos/', '.jpeg', '.jpg', '/em_image', 'maxw=', 'maxh=', 43 | '/flashplayers/', '/apps/', '/gallery/', 'photogallery', 'imageViewer', '.jpg', 'img=', 44 | '/forums/', '/users/', '/tags/', '/audio/', '/resources/', '/metrics/', '/images/', '/products/', 45 | 'com.pe', '/agencia/', '/resizer/', '/user?', '/tag/', '/bookmark/', '/plugins/', '/blogs/', 46 | '/advertising/', 'blockbuster.co.uk', '/oembed/', '/needlogin', 'type=login', '/mailto/', '/feed', 47 | 'sendtofriend.aspx', '/ajax/', 'bloggernews.net', '/topics/', 'view_gallery', '/event.asp', '/forum/', 48 | '/posts/', '/cgi-bin/', '/member/', 'news_tool_v2.cfm', '/database/', '/Default.aspx', 49 | '/Search/', '/Slideshow/', '/slideshow/', '/user/', '/register/', '/donate/', '/calendar/', 50 | 'send-to-friend', 51 | '/enter/', '/photo-gallery/', '/news_email.asp', '/Flash.aspx', '/findlocal/', '/ads/', '/reply/', 52 | '/events/', '/picture-gallery/', '/slideshow?', '/Mozilla/', '/sendtoafriend.asp', '/blog/', 53 | '/mailStory/', 'admin.asp?', '.ads/', '/used_cars/' 54 | ] 55 | BANNED_STRINGS = [re.escape(x) for x in BANNED_STRINGS] 56 | is_banned_regex = re.compile(r'(' + r'|'.join(BANNED_STRINGS) + r')') 57 | 58 | 59 | def _url_seems_ok(url, domain_to_allowed_subdomains): 60 | """ 61 | Check if the URL seems ok. if it does then we'll return a tuple of 62 | CLEAN URL, main domain. 63 | :param url: 64 | :return: 65 | """ 66 | # Long URLs are usually bad 67 | if len(url) > 200: 68 | return False 69 | 70 | # FIRST check if the domain is OK 71 | ext = tldextract.extract(url) 72 | main_domain = ext.domain + '.' + ext.suffix 73 | allowed_subdomains = domain_to_allowed_subdomains.get(main_domain, None) 74 | if allowed_subdomains is None: 75 | return False 76 | 77 | if isinstance(allowed_subdomains, list) and not ext.subdomain in allowed_subdomains: 78 | return False 79 | 80 | # Check for banned extensios 81 | parsed = urlparse(url) 82 | parsed = parsed._replace(query="", fragment="") 83 | path_to_use = parsed.path 84 | file_extension = os.path.splitext(path_to_use)[1] 85 | if file_extension in BANNED_EXTENSIONS: 86 | return False 87 | 88 | # If there are two dotcoms then that's probably bad! 89 | endings = len(re.findall(r'(\.com|\.co\.uk|\.net|\.org)', url)) 90 | if endings > 1: 91 | return False 92 | 93 | # Check for banned words 94 | if not (is_banned_regex.search(url) is None): 95 | return False 96 | 97 | # AT A LATER DATE: we need to check if the URL was banned 98 | return (parsed.geturl(), main_domain) 99 | 100 | 101 | def _filter_excessive_newlines(text): 102 | return re.sub(r'\n\s+', r'\n', text) 103 | 104 | 105 | class Article(object): 106 | """ NEWSPAPER VERSION """ 107 | 108 | def __init__(self, html): 109 | self.html = html if html is not None else "" 110 | 111 | self.dummy_article = newspaper.Article(url='', fetch_images=False, verbose=True) 112 | self.dummy_article.set_html(html) 113 | self.dummy_article.parse() 114 | 115 | self.text = _filter_excessive_newlines(self.dummy_article.text) 116 | self.authors = self.dummy_article.authors 117 | self.authors = [x for x in self.authors if len(x.split(' ')) < 10] 118 | self.title = self.dummy_article.title 119 | 120 | # sometimes the text started with the title... that's bad 121 | if self.text.startswith(self.title + '\n'): 122 | self.text = self.text[len(self.title):].lstrip('\n') 123 | 124 | if self.dummy_article.publish_date and not isinstance(self.dummy_article.publish_date, str): 125 | try: 126 | self.publish_date = self.dummy_article.publish_date.date().strftime( 127 | "%m-%d-%Y") 128 | except AttributeError: 129 | self.publish_date = None 130 | else: 131 | self.publish_date = None 132 | 133 | self._extract_summary() 134 | 135 | def _extract_summary(self): 136 | self.summary = None 137 | for good2bad in [('og', 'description'), ('twitter', 'description'), ('description',)]: 138 | curr_dict = self.dummy_article.meta_data 139 | for key in good2bad[:-1]: 140 | curr_dict = curr_dict.get(key, {}) 141 | summary = curr_dict.get(good2bad[-1], '').strip() 142 | 143 | if len(summary) > 30: 144 | self.summary = summary 145 | return 146 | 147 | def num_empty_fields(self): 148 | num_empty = 0 149 | for k, v in self.serialize().items(): 150 | if not v: 151 | num_empty += 1 152 | return num_empty 153 | 154 | def serialize(self): 155 | """ 156 | Return simple page object to JSONify and write to file. 157 | """ 158 | return { 159 | 'meta_lang': self.dummy_article.meta_lang, 160 | 'title': self.title, 161 | 'text': self.text, 162 | 'summary': self.summary, 163 | 'authors': self.authors, 164 | 'publish_date': self.publish_date 165 | } 166 | 167 | def __repr__(self): 168 | return str(self.serialize()) 169 | 170 | 171 | def parse_record(record, propaganda=False): 172 | if record.rec_type != 'response': 173 | return 174 | if record.content_type != 'application/http; msgtype=response': 175 | return 176 | 177 | url_was_ok = _url_seems_ok(record.rec_headers['WARC-Target-URI'], 178 | domain_to_allowed_subdomains=PROPAGANDA_SUBDOMAINS if propaganda else ALLOWED_SUBDOMAINS) 179 | if not url_was_ok: 180 | return 181 | 182 | url, domain = url_was_ok 183 | 184 | try: 185 | html = record.content_stream().read().decode('utf-8') 186 | except UnicodeDecodeError: 187 | # yield {'status': 'fail', 'url': url, 'reason': 'parse'} 188 | return 189 | 190 | if not html: 191 | # yield {'status': 'fail', 'url': url, 'reason': 'parse'} 192 | return 193 | 194 | try: 195 | article = Article(html).serialize() 196 | except ValueError: 197 | # yield {'status': 'fail', 'url': url, 'reason': 'parse'} 198 | return 199 | 200 | # Check if is good 201 | if article['publish_date'] is None: 202 | # yield {'status': 'fail', 'url': url, 'reason': 'date'} 203 | return 204 | if len(article['text']) < 1000: 205 | # yield {'status': 'fail', 'url': url, 'reason': 'len'} 206 | return 207 | if len(article['title']) < 30: 208 | # yield {'status': 'fail', 'url': url, 'reason': 'title'} 209 | return 210 | 211 | if article.pop('meta_lang') != 'en': 212 | # yield {'status': 'fail', 'url': url, 'reason': 'lang'} 213 | return 214 | 215 | article['status'] = 'success' 216 | article['url'] = url 217 | article['domain'] = domain 218 | article['warc_date'] = record.rec_headers['WARC-Date'] 219 | yield article 220 | 221 | 222 | # NOTE: You might have to put in your credentials here, like 223 | # s3client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key) 224 | s3client = boto3.client('s3') 225 | 226 | parser = argparse.ArgumentParser() 227 | parser.add_argument('-path', type=str, 228 | default='crawl-data/CC-MAIN-2017-13/segments/1490218186353.38/warc/CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz', 229 | help='in path') 230 | parser.add_argument('-bucket_name', type=str, 231 | help='out path') 232 | parser.add_argument('-propaganda', action='store_true', 233 | help='Download some propaganda instead of real news') 234 | args = parser.parse_args() 235 | 236 | archive_date = args.path.split('/')[1] 237 | rest = '_'.join(args.path.split('/')[2:]) 238 | out_prefix = 'propaganda-' if args.propaganda else '' 239 | 240 | out_key = '{}{}/{}.jsonl'.format(out_prefix, args.path.split('/')[1], rest) 241 | 242 | with TemporaryFile(mode='w+b', dir='/home/ubuntu/temp/') as warctemp: 243 | s3client.download_fileobj('commoncrawl', args.path, warctemp) 244 | warctemp.seek(0) 245 | 246 | with NamedTemporaryFile(mode='w', dir='/home/ubuntu/temp/') as f: 247 | for record in tqdm(ArchiveIterator(warctemp, no_record_parse=False)): 248 | for parsed_record in parse_record(record, propaganda=args.propaganda): 249 | f.write(json.dumps(parsed_record) + '\n') 250 | 251 | s3client.upload_file(f.name, args.bucket_name, out_key) 252 | 253 | print("I guess I'm done now") 254 | -------------------------------------------------------------------------------- /realnews/dedupe_crawl.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import ujson as json 3 | from tqdm import tqdm 4 | from concurrent.futures import ThreadPoolExecutor 5 | from tempfile import NamedTemporaryFile, TemporaryDirectory 6 | from pybloof import StringBloomFilter 7 | from collections import defaultdict 8 | import random 9 | import os 10 | import re 11 | from boto3.s3.transfer import TransferConfig 12 | import pandas as pd 13 | 14 | # this is hella usefl https://krisives.github.io/bloom-calculator/ 15 | has_seen_url = StringBloomFilter(size=14440984416, hashes=10) 16 | has_seen_content_start = StringBloomFilter(size=14440984416, hashes=10) 17 | # has_seen_content_end = StringBloomFilter(size=14440984416, hashes=10) 18 | 19 | s3client = boto3.client('s3') 20 | 21 | DUMP_ORDER = [ 22 | 'CC-MAIN-2016-50', 23 | 'CC-MAIN-2017-04', 24 | 'CC-MAIN-2017-09', 25 | 'CC-MAIN-2017-13', 26 | 'CC-MAIN-2017-17', 27 | 'CC-MAIN-2017-22', 28 | 'CC-MAIN-2017-26', 29 | 'CC-MAIN-2017-30', 30 | 'CC-MAIN-2017-34', 31 | 'CC-MAIN-2017-39', 32 | 'CC-MAIN-2017-43', 33 | 'CC-MAIN-2017-47', 34 | 'CC-MAIN-2017-51', 35 | 'CC-MAIN-2018-05', 36 | 'CC-MAIN-2018-09', 37 | 'CC-MAIN-2018-13', 38 | 'CC-MAIN-2018-17', 39 | 'CC-MAIN-2018-22', 40 | 'CC-MAIN-2018-26', 41 | 'CC-MAIN-2018-30', 42 | 'CC-MAIN-2018-34', 43 | 'CC-MAIN-2018-39', 44 | 'CC-MAIN-2018-43', 45 | 'CC-MAIN-2018-47', 46 | 'CC-MAIN-2018-51', 47 | 'CC-MAIN-2019-04', 48 | 'CC-MAIN-2019-09', 49 | 'CC-MAIN-2019-13', 50 | ][::-1] 51 | 52 | TRAIN_PORTION = 0.95 53 | CONTENT_LENGTH = 100 54 | 55 | 56 | def _get_split(domain): 57 | """ You could do this by domain, or not""" 58 | if random.random() < TRAIN_PORTION: 59 | return 'train' 60 | return 'val' 61 | 62 | 63 | def get_matching_s3_objects(bucket, prefix='', suffix=''): 64 | """ 65 | Generate objects in an S3 bucket. 66 | THANK YOU https://alexwlchan.net/2018/01/listing-s3-keys-redux/ 67 | 68 | :param bucket: Name of the S3 bucket. 69 | :param prefix: Only fetch objects whose key starts with 70 | this prefix (optional). 71 | :param suffix: Only fetch objects whose keys end with 72 | this suffix (optional). 73 | """ 74 | kwargs = {'Bucket': bucket} 75 | 76 | # If the prefix is a single string (not a tuple of strings), we can 77 | # do the filtering directly in the S3 API. 78 | if isinstance(prefix, str): 79 | kwargs['Prefix'] = prefix 80 | 81 | while True: 82 | 83 | # The S3 API response is a large blob of metadata. 84 | # 'Contents' contains information about the listed objects. 85 | resp = s3client.list_objects_v2(**kwargs) 86 | 87 | try: 88 | contents = resp['Contents'] 89 | except KeyError: 90 | return 91 | 92 | for obj in contents: 93 | key = obj['Key'] 94 | if key.startswith(prefix) and key.endswith(suffix): 95 | yield obj['Key'''] 96 | 97 | # The S3 API is paginated, returning up to 1000 keys at a time. 98 | # Pass the continuation token into the next response, until we 99 | # reach the final page (when this field is missing). 100 | try: 101 | kwargs['ContinuationToken'] = resp['NextContinuationToken'] 102 | except KeyError: 103 | break 104 | 105 | 106 | def iterate_over_batches(stream, batch_size=64): 107 | buffer = [] 108 | for x in stream: 109 | buffer.append(x) 110 | if len(buffer) >= batch_size: 111 | yield buffer 112 | buffer = [] 113 | if len(buffer) > 0: 114 | yield buffer 115 | 116 | def _could_be_author(author): 117 | author_lower = author.lower().strip() 118 | if author_lower.startswith(('https', 'www.', 'min read')): 119 | return False 120 | if '.com' in author_lower: 121 | return False 122 | if author_lower in {'arts', 'politics', 'sports', 'january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december'}: 123 | return False 124 | return True 125 | 126 | def _fix_notfound_authors(article): 127 | """ 128 | # An extra preprocessing step: if author list is empty and article starts with By then let's fix things. 129 | :param article: 130 | :return: 131 | """ 132 | if len(article['authors']) == 0 and article['text'].startswith('By ') and '\n' in article: 133 | possible_authors, text = article['text'][3:].split('\n', maxsplit=1) 134 | if len(possible_authors.split(' ')) < 6: 135 | article['authors'] = [possible_authors.strip()] 136 | article['text'] = text.strip() 137 | 138 | article['authors'] = [x for x in article['authors'] if _could_be_author(x)] 139 | 140 | # Those aren't summaries 141 | if article['summary'] is not None and article['summary'].endswith(('...','…')): 142 | article['summary'] = None 143 | 144 | 145 | def _fix_photos(article): 146 | article['text'] += '\n' 147 | article['text'] = re.sub(r'(Facebook Twitter Pinterest |ADVERTISEMENT ADVERTISEMENT|ADVERTISEMENT Thanks for watching! Visit Website)', '', article['text']) 148 | article['text'] = re.sub(r'\nAdvertisement\s+Advertisement\n', '\n', article['text']) 149 | 150 | article['text'] = re.sub(r'\((Photo|Image|Source|Photograph): .{1, 60}\)', '', article['text']) 151 | article['text'] = re.sub(r'\n(Photo|Image|Source|Photograph): .{1, 60}\n', '\n', article['text']) 152 | article['text'] = re.sub(r'\nPhoto Published on .{1, 60}\n', '\n', article['text']) 153 | 154 | article['text'] = re.sub(r'\.\s+(Photo|Image): .{1, 60}\n', '.\n', article['text']) 155 | article['text'] = re.sub(r'\nPicture Courtesy: .{1, 60}\n', '\n', article['text']) 156 | article['text'] = re.sub(r'\n(\[Related:|RELATED|READ MORE:|PHOTOS:|SEE ALSO:|Also On News|MORE:) .{1, 120}\n', '\n', article['text']) 157 | article['text'] = re.sub(r'Share this: Facebook\nTwitter\nGoogle\nWhatsApp\nEmail\nCopy\n', '\n', article['text']) 158 | 159 | 160 | article['text'] = re.sub(r'\n+', '\n', article['text']) 161 | # article['text'] = re.sub(r'http.+\b', '', article['text']) 162 | article['text'].strip() 163 | 164 | 165 | 166 | # Forbes often has these duplications 167 | if article['domain'] == 'forbes.com': 168 | for company_name in ['Apple', 'Microsoft', 'Google', 'Amazon', 'Chase', 'Citigroup', 'Comcast', 169 | 'Cisco', 'Disney', 'Facebook', 'Intel', 'Netflix', 'Nike', 'Starbucks', 'NVIDIA', 170 | 'Raytheon', 'Visa', 'Verizon', 'ExxonMobil']: 171 | article['text'] = article['text'].replace(f'{company_name} {company_name}', f'{company_name}') 172 | 173 | 174 | class Fetcher(object): 175 | def __init__( 176 | self, 177 | workers=8, 178 | ): 179 | self.workers = workers 180 | 181 | def download(self, obj_key_batch): 182 | """ 183 | Download a thing. 184 | """ 185 | with ThreadPoolExecutor(self.workers) as executor: 186 | yield from executor.map(self._thread, obj_key_batch) 187 | 188 | def _thread(self, obj_key): 189 | article_list = [] 190 | 191 | with NamedTemporaryFile(mode='w+b', dir='/home/ubuntu/temp2/') as packet_temp: 192 | s3client.download_fileobj('periodista', obj_key, packet_temp) 193 | packet_temp.seek(0) 194 | 195 | with open(packet_temp.name, 'r') as fin: 196 | for l in fin: 197 | article = json.loads(l) 198 | 199 | # Preprocessing could go here 200 | _fix_notfound_authors(article) 201 | _fix_photos(article) 202 | article_list.append(article) 203 | return article_list 204 | 205 | 206 | def fast_article_iterator(cc_name, batch_size=256): 207 | for obj_key_batch in tqdm(iterate_over_batches(get_matching_s3_objects('periodista', prefix=cc_name), 208 | batch_size=batch_size), total=64000 // batch_size): 209 | fetcher = Fetcher(workers=16) 210 | for article_list in fetcher.download(obj_key_batch): 211 | for article in article_list: 212 | yield article 213 | 214 | 215 | def _is_definitely_unique(article): 216 | # CERTAIN THINGS ALWAYS NEED TO BE BANNED 217 | if len(re.findall(r'Image \d+ of \d+', article['text'])) > 2: 218 | return False 219 | 220 | if ' '.join(article['authors']) == 'News Traffic Weather': 221 | return False 222 | 223 | if article['url'] in has_seen_url: 224 | return False 225 | 226 | if article['text'][:CONTENT_LENGTH] in has_seen_content_start: 227 | return False 228 | 229 | has_seen_url.add(article['url']) 230 | has_seen_content_start.add(article['text'][:CONTENT_LENGTH]) 231 | return True 232 | 233 | 234 | def _get_mini_sample(num_to_return=1000): 235 | articles = [] 236 | hits = 0 237 | misses = 0 238 | domain2count = defaultdict(int) 239 | for article in fast_article_iterator(DUMP_ORDER[0]): 240 | if _is_definitely_unique(article): 241 | domain2count[article['domain']] += 1 242 | articles.append(article) 243 | hits += 1 244 | else: 245 | misses += 1 246 | if (hits + misses) % 100000 == 0: 247 | print(f"{hits} hits and {misses} misses", flush=True) 248 | 249 | if len(articles) > (num_to_return * 1000): 250 | break 251 | random.shuffle(articles) 252 | return articles[:num_to_return], dict(domain2count) 253 | 254 | 255 | def upload_to_s3(in_fn, out_fn): 256 | config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10, 257 | multipart_chunksize=1024 * 25, use_threads=True) 258 | s3client.upload_file(in_fn, 'periodista', out_fn, 259 | ExtraArgs={'ACL': 'public-read'}, 260 | Config=config, 261 | ) 262 | 263 | 264 | def _iterate_through_archivedotorg(bucket_name): 265 | with NamedTemporaryFile(mode='w+b', dir='/home/ubuntu/temp2/') as packet_temp: 266 | s3client.download_fileobj(bucket_name, 'archivedotorg.jsonl', packet_temp) 267 | packet_temp.seek(0) 268 | 269 | with open(packet_temp.name, 'r') as fin: 270 | for l in fin: 271 | article = json.loads(l) 272 | article['split'] = _get_split(article['domain']) 273 | if article['split'] == 'ignore': 274 | article['split'] = 'train' 275 | 276 | # Preprocessing could go here 277 | _fix_notfound_authors(article) 278 | _fix_photos(article) 279 | if _is_definitely_unique(article): 280 | yield article 281 | 282 | 283 | if __name__ == '__main__': 284 | # Iterate through and also get the archive.org scrape 285 | hits = 0 286 | misses = 0 287 | domain2count = defaultdict(int) 288 | 289 | BUCKET_NAME = "MYBUCKETNAME" 290 | 291 | with open('/home/ubuntu/temp2/news.jsonl', 'w') as f: 292 | # First get the archive.org scrape, which already is going to handle deduplication /etc 293 | for article in _iterate_through_archivedotorg(BUCKET_NAME): 294 | domain2count[article['domain']] += 1 295 | f.write(json.dumps(article) + '\n') 296 | hits += 1 297 | if hits % 1000 == 0: 298 | print(article, flush=True) 299 | 300 | print("Got {} from archive.org".format(hits)) 301 | 302 | for cc_name in DUMP_ORDER: 303 | for article in fast_article_iterator(cc_name): 304 | if _is_definitely_unique(article): 305 | domain2count[article['domain']] += 1 306 | 307 | article['split'] = _get_split(article['domain']) 308 | if article['split'] != 'ignore': 309 | f.write(json.dumps(article) + '\n') 310 | hits += 1 311 | if hits % 100000 == 0: 312 | print(article, flush=True) 313 | else: 314 | misses += 1 315 | if (hits + misses) % 100000 == 0: 316 | print(f"{hits} hits and {misses} misses", flush=True) 317 | 318 | upload_to_s3('/home/ubuntu/temp2/news.jsonl', out_fn='news-apr-15-2019.jsonl') 319 | with NamedTemporaryFile(mode='w', dir='/home/ubuntu/temp2/') as out_tmp: 320 | json.dump(dict(domain2count), out_tmp) 321 | s3client.upload_file(out_tmp.name, BUCKET_NAME, 'domain2count.json', ExtraArgs={'ACL': 'public-read'}) -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2019 Rowan Zellers 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | 203 | -------------------- 204 | 205 | Some functions in `sample/` come from OpenAI but with modifications, their license is below 206 | 207 | MIT License 208 | 209 | Copyright (c) 2019 OpenAI 210 | 211 | Permission is hereby granted, free of charge, to any person obtaining a copy 212 | of this software and associated documentation files (the "Software"), to deal 213 | in the Software without restriction, including without limitation the rights 214 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 215 | copies of the Software, and to permit persons to whom the Software is 216 | furnished to do so, subject to the following conditions: 217 | 218 | The above copyright notice and this permission notice shall be included in all 219 | copies or substantial portions of the Software. 220 | 221 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 222 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 223 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 224 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 225 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 226 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 227 | SOFTWARE. -------------------------------------------------------------------------------- /discrimination/run_discrimination.py: -------------------------------------------------------------------------------- 1 | # Original work Copyright 2018 The Google AI Language Team Authors. 2 | # Modified work Copyright 2019 Rowan Zellers 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | """ 17 | For discrimination finetuning (e.g. saying whether or not the generation is human/grover) 18 | """ 19 | import json 20 | import os 21 | 22 | import numpy as np 23 | import tensorflow as tf 24 | from tensorflow.python.lib.io import file_io 25 | 26 | from lm.dataloader import classification_convert_examples_to_features, classification_input_fn_builder 27 | from lm.modeling import classification_model_fn_builder, GroverConfig 28 | from lm.utils import _save_np 29 | from sample.encoder import get_encoder 30 | 31 | flags = tf.flags 32 | 33 | FLAGS = flags.FLAGS 34 | 35 | ## Required parameters 36 | flags.DEFINE_string( 37 | "config_file", 'configs/base.json', 38 | "The config json file corresponding to the pre-trained news model. " 39 | "This specifies the model architecture.") 40 | 41 | flags.DEFINE_string( 42 | "input_data", None, 43 | "The input data dir. Should contain the .tsv files (or other data files) for the task.") 44 | 45 | flags.DEFINE_string( 46 | "additional_data", None, 47 | "Should we provide additional input data? maybe.") 48 | 49 | flags.DEFINE_string( 50 | "output_dir", None, 51 | "The output directory where the model checkpoints will be written.") 52 | 53 | ## Other parameters 54 | flags.DEFINE_string( 55 | "init_checkpoint", None, 56 | "Initial checkpoint (usually from a pre-trained model).") 57 | 58 | flags.DEFINE_integer( 59 | "max_seq_length", 1024, 60 | "The maximum total input sequence length after WordPiece tokenization. " 61 | "Sequences longer than this will be truncated, and sequences shorter " 62 | "than this will be padded. Must match data generation.") 63 | 64 | flags.DEFINE_integer("iterations_per_loop", 1000, 65 | "How many steps to make in each estimator call.") 66 | 67 | flags.DEFINE_integer("batch_size", 32, "Batch size used") 68 | 69 | flags.DEFINE_integer("max_training_examples", -1, "if you wanna limit the number") 70 | 71 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 72 | 73 | flags.DEFINE_bool("predict_val", False, "Whether to run eval on the dev set.") 74 | 75 | flags.DEFINE_bool( 76 | "predict_test", False, 77 | "Whether to run the model in inference mode on the test set.") 78 | 79 | flags.DEFINE_float("num_train_epochs", 3.0, 80 | "Total number of training epochs to perform.") 81 | 82 | flags.DEFINE_float( 83 | "warmup_proportion", 0.1, 84 | "Proportion of training to perform linear learning rate warmup for. " 85 | "E.g., 0.1 = 10% of training.") 86 | 87 | flags.DEFINE_bool("adafactor", False, "Whether to run adafactor") 88 | 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 90 | 91 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 92 | 93 | flags.DEFINE_string( 94 | "tpu_name", None, 95 | "The Cloud TPU to use for training. This should be either the name " 96 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 97 | "url.") 98 | 99 | flags.DEFINE_string( 100 | "tpu_zone", None, 101 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 102 | "specified, we will attempt to automatically detect the GCE project from " 103 | "metadata.") 104 | 105 | flags.DEFINE_string( 106 | "gcp_project", None, 107 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 108 | "specified, we will attempt to automatically detect the GCE project from " 109 | "metadata.") 110 | 111 | flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 112 | 113 | flags.DEFINE_integer( 114 | "num_tpu_cores", 8, 115 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 116 | 117 | 118 | def _flatten_and_tokenize_metadata(encoder, item): 119 | """ 120 | Turn the article into tokens 121 | :param item: Contains things that need to be tokenized 122 | 123 | fields are ['domain', 'date', 'authors', 'title', 'article', 'summary'] 124 | :return: dict 125 | """ 126 | metadata = [] 127 | for key in ['domain', 'date', 'authors', 'title', 'article']: 128 | val = item.get(key, None) 129 | if val is not None: 130 | metadata.append(encoder.__dict__[f'begin_{key}']) 131 | metadata.extend(encoder.encode(val)) 132 | metadata.append(encoder.__dict__[f'end_{key}']) 133 | return metadata 134 | 135 | 136 | def main(_): 137 | LABEL_LIST = ['machine', 'human'] 138 | LABEL_INV_MAP = {label: i for i, label in enumerate(LABEL_LIST)} 139 | 140 | tf.logging.set_verbosity(tf.logging.INFO) 141 | 142 | # These lines of code are just to check if we've already saved something into the directory 143 | if tf.gfile.Exists(FLAGS.output_dir): 144 | print(f"The output directory {FLAGS.output_dir} exists!") 145 | if FLAGS.do_train: 146 | print("EXITING BECAUSE DO_TRAIN is true", flush=True) 147 | return 148 | for split in ['val', 'test']: 149 | if tf.gfile.Exists(os.path.join(FLAGS.output_dir, f'{split}-probs.npy')) and getattr(FLAGS, 150 | f'predict_{split}'): 151 | print(f"EXITING BECAUSE {split}-probs.npy exists", flush=True) 152 | return 153 | # Double check to see if it has trained! 154 | if not tf.gfile.Exists(os.path.join(FLAGS.output_dir, 'checkpoint')): 155 | print("EXITING BECAUSE NO CHECKPOINT.", flush=True) 156 | return 157 | stuff = {} 158 | with tf.gfile.Open(os.path.join(FLAGS.output_dir, 'checkpoint'), 'r') as f: 159 | # model_checkpoint_path: "model.ckpt-0" 160 | # all_model_checkpoint_paths: "model.ckpt-0" 161 | for l in f: 162 | key, val = l.strip().split(': ', 1) 163 | stuff[key] = val.strip('"') 164 | if stuff['model_checkpoint_path'] == 'model.ckpt-0': 165 | print("EXITING BECAUSE IT LOOKS LIKE NOTHING TRAINED", flush=True) 166 | return 167 | 168 | 169 | elif not FLAGS.do_train: 170 | print("EXITING BECAUSE DO_TRAIN IS FALSE AND PATH DOESNT EXIST") 171 | return 172 | else: 173 | tf.gfile.MakeDirs(FLAGS.output_dir) 174 | 175 | news_config = GroverConfig.from_json_file(FLAGS.config_file) 176 | 177 | # TODO might have to change this 178 | encoder = get_encoder() 179 | examples = {'train': [], 'val': [], 'test': []} 180 | np.random.seed(123456) 181 | tf.logging.info("*** Parsing files ***") 182 | with tf.gfile.Open(FLAGS.input_data, "r") as f: 183 | for l in f: 184 | item = json.loads(l) 185 | 186 | # This little hack is because we don't want to tokenize the article twice 187 | context_ids = _flatten_and_tokenize_metadata(encoder=encoder, item=item) 188 | examples[item['split']].append({ 189 | 'info': item, 190 | 'ids': context_ids, 191 | 'label': item['label'], 192 | }) 193 | assert item['label'] in LABEL_INV_MAP 194 | 195 | additional_data = {'machine': [], 'human': []} 196 | if FLAGS.additional_data is not None: 197 | print("NOW WERE LOOKING AT ADDITIONAL INPUT DATA", flush=True) 198 | with tf.gfile.Open(FLAGS.additional_data, "r") as f: 199 | for l in f: 200 | item = json.loads(l) 201 | # This little hack is because we don't want to tokenize the article twice 202 | context_ids = _flatten_and_tokenize_metadata(encoder=encoder, item=item) 203 | additional_data[item['label']].append({ 204 | 'info': item, 205 | 'ids': context_ids, 206 | 'label': item['label'], 207 | }) 208 | 209 | tf.logging.info("*** Done parsing files ***") 210 | print("LETS GO", flush=True) 211 | if FLAGS.max_training_examples > 0: 212 | 213 | examples_by_label = {'human': [], 'machine': []} 214 | for x in examples['train']: 215 | examples_by_label[x['label']].append(x) 216 | 217 | new_examples = [] 218 | print("Unique machine examples: {} -> {}".format(len(examples_by_label['machine']), 219 | FLAGS.max_training_examples), flush=True) 220 | machine_ex_to_keep = examples_by_label['machine'][:FLAGS.max_training_examples] 221 | 222 | # So we just cut down on the TRUE machine examples. now lets try adding in additional examples 223 | # examples_by_label['human'].extend(additional_data['human']) 224 | 225 | if len(additional_data['machine']) > 0: 226 | amount_to_add = len(examples_by_label['human']) - len(machine_ex_to_keep) 227 | if amount_to_add > 0: 228 | machine_ex_to_keep.extend(additional_data['machine'][:amount_to_add]) 229 | 230 | for i, human_ex in enumerate(examples_by_label['human']): 231 | new_examples.append(human_ex) 232 | new_examples.append(machine_ex_to_keep[i % len(machine_ex_to_keep)]) 233 | 234 | print("Length of examples: {} -> {}".format(len(examples['train']), len(new_examples)), flush=True) 235 | examples['train'] = new_examples 236 | 237 | # Training 238 | if FLAGS.do_train: 239 | num_train_steps = int((len(examples['train']) / FLAGS.batch_size) * FLAGS.num_train_epochs) 240 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 241 | assert num_train_steps > 0 242 | else: 243 | num_train_steps = None 244 | num_warmup_steps = None 245 | 246 | # Boilerplate 247 | tpu_cluster_resolver = None 248 | if FLAGS.use_tpu and FLAGS.tpu_name: 249 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 250 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 251 | 252 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 253 | run_config = tf.contrib.tpu.RunConfig( 254 | cluster=tpu_cluster_resolver, 255 | master=FLAGS.master, 256 | model_dir=FLAGS.output_dir, 257 | save_checkpoints_steps=FLAGS.iterations_per_loop, 258 | keep_checkpoint_max=None, 259 | tpu_config=tf.contrib.tpu.TPUConfig( 260 | iterations_per_loop=FLAGS.iterations_per_loop, 261 | num_shards=FLAGS.num_tpu_cores, 262 | per_host_input_for_training=is_per_host)) 263 | 264 | model_fn = classification_model_fn_builder( 265 | news_config, 266 | init_checkpoint=FLAGS.init_checkpoint, 267 | learning_rate=FLAGS.learning_rate, 268 | num_train_steps=num_train_steps, 269 | num_warmup_steps=num_warmup_steps, 270 | use_tpu=FLAGS.use_tpu, 271 | num_labels=len(LABEL_LIST), 272 | pool_token_id=encoder.begin_summary, 273 | adafactor=FLAGS.adafactor 274 | ) 275 | 276 | # If TPU is not available, this will fall back to normal Estimator on CPU 277 | # or GPU. 278 | estimator = tf.contrib.tpu.TPUEstimator( 279 | use_tpu=FLAGS.use_tpu, 280 | model_fn=model_fn, 281 | config=run_config, 282 | train_batch_size=FLAGS.batch_size, 283 | eval_batch_size=FLAGS.batch_size, 284 | predict_batch_size=FLAGS.batch_size, 285 | params={'model_dir': FLAGS.output_dir} 286 | ) 287 | 288 | if FLAGS.do_train: 289 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 290 | 291 | tf.logging.info(f"***** Recreating training file at {train_file} *****") 292 | classification_convert_examples_to_features(examples['train'], batch_size=FLAGS.batch_size, 293 | max_seq_length=FLAGS.max_seq_length, 294 | encoder=encoder, output_file=train_file, 295 | labels=LABEL_LIST, 296 | chop_from_front_if_needed=False) 297 | tf.logging.info("***** Running training *****") 298 | tf.logging.info(" Num examples = %d", len(examples['train'])) 299 | tf.logging.info(" Num epochs = %d", FLAGS.num_train_epochs) 300 | tf.logging.info(" Batch size = %d", FLAGS.batch_size) 301 | tf.logging.info(" Num steps = %d", num_train_steps) 302 | 303 | train_input_fn = classification_input_fn_builder(input_file=train_file, seq_length=FLAGS.max_seq_length, 304 | is_training=True, drop_remainder=True, 305 | ) 306 | estimator.train(input_fn=train_input_fn, steps=num_train_steps) 307 | 308 | splits_to_predict = [x for x in ['val', 'test'] if getattr(FLAGS, f'predict_{x}')] 309 | for split in splits_to_predict: 310 | num_actual_examples = len(examples[split]) 311 | 312 | predict_file = os.path.join(FLAGS.output_dir, f'{split}.tf_record') 313 | tf.logging.info(f"***** Recreating {split} file {predict_file} *****") 314 | classification_convert_examples_to_features(examples[split], batch_size=FLAGS.batch_size, 315 | max_seq_length=FLAGS.max_seq_length, 316 | encoder=encoder, output_file=predict_file, 317 | labels=LABEL_LIST, pad_extra_examples=True, 318 | chop_from_front_if_needed=False) 319 | 320 | val_input_fn = classification_input_fn_builder(input_file=predict_file, seq_length=FLAGS.max_seq_length, 321 | is_training=False, drop_remainder=True, 322 | ) 323 | 324 | probs = np.zeros((num_actual_examples, 2), dtype=np.float32) 325 | for i, res in enumerate(estimator.predict(input_fn=val_input_fn, yield_single_examples=True)): 326 | if i < num_actual_examples: 327 | probs[i] = res['probs'] 328 | 329 | _save_np(os.path.join(FLAGS.output_dir, f'{split}-probs.npy'), probs) 330 | 331 | preds = np.argmax(probs, 1) 332 | labels = np.array([LABEL_INV_MAP[x['label']] for x in examples[split][:num_actual_examples]]) 333 | print('{} ACCURACY IS {:.3f}'.format(split, np.mean(labels == preds)), flush=True) 334 | 335 | 336 | if __name__ == "__main__": 337 | flags.mark_flag_as_required("input_data") 338 | flags.mark_flag_as_required("output_dir") 339 | tf.app.run() 340 | -------------------------------------------------------------------------------- /sample/encoder.py: -------------------------------------------------------------------------------- 1 | """Byte pair encoding utilities 2 | 3 | Some functions are adapted from OpenAI but with modifications 4 | 5 | https://github.com/openai/gpt-2 6 | """ 7 | 8 | import os 9 | import json 10 | import regex as re 11 | from functools import lru_cache 12 | import tensorflow as tf 13 | import random 14 | import numpy as np 15 | 16 | 17 | @lru_cache() 18 | def bytes_to_unicode(): 19 | """ 20 | Returns list of utf-8 byte and a corresponding list of unicode strings. 21 | The reversible bpe codes work on unicode strings. 22 | This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. 23 | When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. 24 | This is a signficant percentage of your normal, say, 32K bpe vocab. 25 | To avoid that, we want lookup tables between utf-8 bytes and unicode strings. 26 | And avoids mapping to whitespace/control characters the bpe code barfs on. 27 | """ 28 | bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) 29 | cs = bs[:] 30 | n = 0 31 | for b in range(2 ** 8): 32 | if b not in bs: 33 | bs.append(b) 34 | cs.append(2 ** 8 + n) 35 | n += 1 36 | cs = [chr(n) for n in cs] 37 | return dict(zip(bs, cs)) 38 | 39 | 40 | def get_pairs(word): 41 | """Return set of symbol pairs in a word. 42 | 43 | Word is represented as tuple of symbols (symbols being variable-length strings). 44 | """ 45 | pairs = set() 46 | prev_char = word[0] 47 | for char in word[1:]: 48 | pairs.add((prev_char, char)) 49 | prev_char = char 50 | return pairs 51 | 52 | 53 | class Encoder: 54 | def __init__(self, encoder, bpe_merges, errors='replace'): 55 | self.encoder = {k: v + 1 for k, v in encoder.items()} 56 | self.encoder['<|padding|>'] = 0 57 | self.padding = 0 58 | 59 | del self.encoder['<|endoftext|>'] 60 | 61 | for special_token_type in ['domain', 'date', 'authors', 'title', 'article', 'summary']: 62 | setattr(self, f'begin_{special_token_type}', len(self.encoder)) 63 | self.encoder[f'<|begin{special_token_type}|>'] = len(self.encoder) 64 | 65 | setattr(self, f'end_{special_token_type}', len(self.encoder)) 66 | self.encoder[f'<|endof{special_token_type}|>'] = len(self.encoder) 67 | 68 | # This will be used if we want to combine short articles. 69 | self.reset_context = len(self.encoder) 70 | self.encoder['<|resetcontext|>'] = len(self.encoder) 71 | 72 | ################################## END OF SPECIAL TOKENS TO ADD 73 | 74 | self.decoder = {v: k for k, v in self.encoder.items()} 75 | self.errors = errors # how to handle errors in decoding 76 | self.byte_encoder = bytes_to_unicode() 77 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 78 | self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) 79 | self.cache = {} 80 | 81 | # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions 82 | self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") 83 | 84 | def bpe(self, token): 85 | if token in self.cache: 86 | return self.cache[token] 87 | word = tuple(token) 88 | pairs = get_pairs(word) 89 | 90 | if not pairs: 91 | return token 92 | 93 | while True: 94 | bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf'))) 95 | if bigram not in self.bpe_ranks: 96 | break 97 | first, second = bigram 98 | new_word = [] 99 | i = 0 100 | while i < len(word): 101 | try: 102 | j = word.index(first, i) 103 | new_word.extend(word[i:j]) 104 | i = j 105 | except: 106 | new_word.extend(word[i:]) 107 | break 108 | 109 | if word[i] == first and i < len(word) - 1 and word[i + 1] == second: 110 | new_word.append(first + second) 111 | i += 2 112 | else: 113 | new_word.append(word[i]) 114 | i += 1 115 | new_word = tuple(new_word) 116 | word = new_word 117 | if len(word) == 1: 118 | break 119 | else: 120 | pairs = get_pairs(word) 121 | word = ' '.join(word) 122 | self.cache[token] = word 123 | return word 124 | 125 | def encode(self, text): 126 | bpe_tokens = [] 127 | for token in re.findall(self.pat, text): 128 | token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8')) 129 | bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')) 130 | return bpe_tokens 131 | 132 | def decode(self, tokens): 133 | text = ''.join([self.decoder[token] for token in tokens]) 134 | text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors) 135 | return text 136 | 137 | def __len__(self): 138 | return len(self.encoder) 139 | 140 | @property 141 | def special_tokens_onehot(self): 142 | """ Return the IDs of all special tokens""" 143 | return [(self.decoder[i].startswith('<|') and self.decoder[i].endswith('|>')) for i in range(len(self))] 144 | 145 | 146 | def get_encoder(): 147 | directory_name = os.path.dirname(__file__) 148 | with open(os.path.join(directory_name, 'encoder.json'), 'r') as f: 149 | encoder = json.load(f) 150 | with open(os.path.join(directory_name, 'vocab.bpe'), 'r', encoding="utf-8") as f: 151 | bpe_data = f.read() 152 | bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]] 153 | return Encoder( 154 | encoder=encoder, 155 | bpe_merges=bpe_merges, 156 | ) 157 | 158 | 159 | ############################################################## 160 | # TURN SOMETHING INTO THE RIGHT FORMAT FOR AN EXAMPLE 161 | ############################################################## 162 | def _tokenize_article_pieces(encoder, item): 163 | """ 164 | Turn the article into tokens 165 | NOTE: in hindsight I kinda messed up here because the first token is always represented as a BPE continuation 166 | rather than an initial token in its own right. whoops.... 167 | 168 | :param item: Contains things that need to be tokenized 169 | 170 | 171 | fields are ['domain', 'date', 'authors', 'title', 'article', 'summary'] 172 | :return: dict 173 | """ 174 | article_pieces = { 175 | 'article': [encoder.begin_article] + encoder.encode(item['text']) + [encoder.end_article], 176 | 'domain': [encoder.begin_domain] + encoder.encode(item['domain']) + [encoder.end_domain], 177 | 'title': [encoder.begin_title] + encoder.encode(item['title']) + [encoder.end_title], 178 | } 179 | # 4/6: Attach the summary too, why the hell not 180 | if item['summary'] and len(item['summary']) > 50: 181 | article_pieces['summary'] = [encoder.begin_summary] + encoder.encode(item['summary']) + [encoder.end_summary] 182 | 183 | # 5/6: date 184 | date_split = item['publish_date'].split('-') 185 | assert len(date_split) == 3 186 | assert date_split[0].isdigit() 187 | 188 | date_txt = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 189 | 'August', 'September', 'October', 'November', 'December'][int(date_split[0]) - 1] + ' {}, {}'.format( 190 | date_split[1], date_split[2]) 191 | article_pieces['date'] = [encoder.begin_date] + encoder.encode(date_txt) + [encoder.end_date] 192 | 193 | # 6/6: authors 194 | authors = ', '.join(item['authors']) 195 | if len(authors) > 5: 196 | article_pieces['authors'] = [encoder.begin_authors] + encoder.encode(authors) + [encoder.end_authors] 197 | return article_pieces 198 | 199 | 200 | def _cut_tokens_to_add_stuff(tokens, stuff_to_add, desired_size, padding_token): 201 | """ 202 | The idea behind this function is to take away tokens from `tokens' such that tokens[:LENGTH] + stuff_to_add becomes 203 | exactly at the right size (desired_size). 204 | 205 | :param tokens: 206 | :param stuff_to_add: 207 | :param desired_size: 208 | :return: 209 | """ 210 | if len(tokens) >= desired_size: 211 | return tokens 212 | 213 | # no way we can add this stuff 214 | if len(stuff_to_add) >= desired_size: 215 | return tokens 216 | 217 | if (len(tokens) + len(stuff_to_add)) <= desired_size: 218 | return tokens + stuff_to_add 219 | 220 | # Otherwise we'll have to actually cut 221 | tokens = tokens[:(desired_size - len(stuff_to_add) - 1)] 222 | tokens.append(padding_token) 223 | tokens.extend(stuff_to_add) 224 | return tokens 225 | 226 | 227 | def tokenize_for_grover_training(encoder, item, desired_size=1024, unconditional_prob=0.35, metadata_dropout_prob=0.1, 228 | cut_prob=0.2): 229 | """ 230 | Not only will we tokenize an item with a BPE encoder, but we'll also put it in a nice format for language modeling. 231 | The goal is to MINIMIZE PADDING. If we don't fill up the desired size of 1024 tokens then we're wasting compute. 232 | 233 | The canonical order is 234 | 235 | DOMAIN DATE AUTHORS TITLE ARTICLE SUMMARY 236 | 237 | 238 | :param encoder: 239 | :param item: Contains things like 240 | {"url": "https://www.advocate.com/node/1010911", 241 | "timestamp": "20180118211607", 242 | "url_used": "https://web.archive.org/web/20180118211607id_/https://www.advocate.com/node/1010911", 243 | "domain": "advocate.com", 244 | "title": "Report: One-Third of Trump's Judicial Picks Are Anti-LGBT", 245 | "text": .... 246 | "summary": .... 247 | "authors": list 248 | "publish_date": ... 249 | } 250 | :param desired_size: the goal for how long the span will be 251 | :param unconditional_prob: The probability that we will generate JUST THE TEXT first. 252 | :param metadata_dropout_prob: The probability that we will drop out each item of metadata 253 | :param cut_prob: The probability that, if we're already over the desired size, we'll cut the article and start 254 | predicting metadata before the desired_size window ends. 255 | :return: 256 | """ 257 | # Get all the bits and pieces 258 | article_pieces = _tokenize_article_pieces(encoder, item) 259 | canonical_metadata_order = ['domain', 'date', 'authors', 'title'] 260 | 261 | # unconditional_prob is probability we only generate the text first, without any metadata 262 | switch = random.random() 263 | if switch < unconditional_prob: 264 | assignments = {'article': 'a'} 265 | chunk_a = article_pieces.pop('article') 266 | chunk_b = [] 267 | for x in canonical_metadata_order + ['summary']: 268 | if random.random() > metadata_dropout_prob: 269 | chunk_b.extend(article_pieces.pop(x, [])) 270 | assignments[x] = 'b' 271 | elif switch < 0.5: 272 | # Put everything in chunk_a, without dropout 273 | assignments = {} 274 | chunk_a = [] 275 | chunk_b = [] 276 | for x in canonical_metadata_order + ['article', 'summary']: 277 | chunk_a.extend(article_pieces.pop(x, [])) 278 | assignments[x] = 'a' 279 | else: 280 | assignments = {} 281 | chunk_a = [] 282 | chunk_b = [] 283 | for k in canonical_metadata_order + ['article', 'summary']: 284 | if random.random() < metadata_dropout_prob and k not in ('article', 'title'): 285 | pass 286 | elif random.random() < 0.5: 287 | if k != 'summary': 288 | chunk_a.extend(article_pieces.pop(k, [])) 289 | assignments[k] = 'a' 290 | else: 291 | chunk_b.extend(article_pieces.pop(k, [])) 292 | assignments[k] = 'b' 293 | 294 | if (len(chunk_a) + len(chunk_b)) <= desired_size: 295 | return chunk_a + chunk_b 296 | 297 | if (assignments.get('article', '') == 'a') and (len(chunk_b) > 0) and (random.random() < cut_prob): 298 | return _cut_tokens_to_add_stuff(chunk_a, chunk_b, desired_size, encoder.padding) 299 | 300 | tokens = chunk_a + chunk_b 301 | return tokens 302 | 303 | 304 | def detokenize(encoder, tokens): 305 | return encoder.decode(tokens) 306 | 307 | 308 | ####################################### 309 | 310 | def create_int_feature(values): 311 | feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 312 | return feature 313 | 314 | 315 | def sliding_window(article, max_seq_length, pad_token): 316 | """ 317 | Randomly sample some spans. It's a simple approximation of sliding window 318 | :param tokens: 319 | :param max_seq_length: 320 | :return: 321 | """ 322 | # if it's shorter, no need for this 323 | if len(article['input_ids']) <= max_seq_length: 324 | amount_to_pad = max_seq_length - len(article['input_ids']) 325 | article['input_ids'].extend([pad_token] * amount_to_pad) 326 | yield article 327 | return 328 | 329 | num_spans = len(article['input_ids']) - max_seq_length + 1 330 | weights = np.ones(num_spans, dtype=np.float32) 331 | # weights[0] = max_seq_length 332 | weights /= weights.sum() 333 | 334 | num_to_yield = int(0.5 + len(article['input_ids']) / max_seq_length) 335 | starts = np.random.choice(num_spans, size=num_to_yield, replace=False, p=weights) 336 | 337 | input_ids = article.pop('input_ids') 338 | for i in starts.tolist(): 339 | article['input_ids'] = input_ids[i:(i + max_seq_length)] 340 | yield article 341 | 342 | 343 | def format_context(encoder, news_article, target): 344 | """ 345 | Generates a news article given some partial information 346 | :param news_article: Contains context 347 | :param target: What we want to get an answer for. 348 | :return: 349 | """ 350 | canonical_metadata_order = ['domain', 'date', 'authors', 'title', 'article'] 351 | tokens = [] 352 | for metadata_category in canonical_metadata_order: 353 | metadata = news_article.get(metadata_category, '').strip() 354 | 355 | # This MIGHT BE needed because I think during training time we never saw empty articles 356 | # if metadata or ((metadata_category == 'article') and target != 'article'): 357 | if (metadata_category == 'article') and (target != 'article'): 358 | metadata = news_article.get('title', '') # Just copy from the title maybe? 359 | 360 | if metadata: 361 | tokens.append(encoder.__dict__[f'begin_{metadata_category}']) 362 | tokens.extend(encoder.encode(metadata)) 363 | tokens.append(encoder.__dict__[f'end_{metadata_category}']) 364 | 365 | assert target in (canonical_metadata_order + ['summary']) 366 | tokens.append(encoder.__dict__[f'begin_{target}']) 367 | return tokens 368 | 369 | 370 | def extract_generated_target(output_tokens, encoder, target): 371 | """ 372 | Given some tokens that were generated, extract the target 373 | :param output_tokens: [num_tokens] thing that was generated 374 | :param encoder: how they were encoded 375 | :param target: the piece of metadata we wanted to generate! 376 | :return: 377 | """ 378 | # Filter out first instance of start token 379 | assert output_tokens.ndim == 1 380 | 381 | start_tokens = output_tokens == encoder.__dict__[f'begin_{target}'] 382 | if np.any(start_tokens): 383 | start_ind = np.argmax(start_tokens) + 1 384 | else: 385 | start_ind = 0 386 | 387 | end_tokens = output_tokens == encoder.__dict__[f'end_{target}'] 388 | if np.any(end_tokens): 389 | end_ind = np.argmax(end_tokens) 390 | else: 391 | end_ind = output_tokens.shape[0] 392 | 393 | return { 394 | 'extraction': encoder.decode(output_tokens[start_ind:end_ind]), 395 | 'start_ind': start_ind, 396 | 'end_ind': end_ind, 397 | } 398 | 399 | 400 | if __name__ == '__main__': 401 | encoder = get_encoder() 402 | print("VOCAB SIZE IS {}".format(len(encoder.encoder))) 403 | -------------------------------------------------------------------------------- /realnews/realnews_tiny.jsonl: -------------------------------------------------------------------------------- 1 | {"title":"Reynolds High School tragedy: more laws are not the answer","text":"by In the news\nby Dan Lucas\nOn Tuesday morning a 15-year-old killer murdered fellow student Emilio Hoffman, age 14, and wounded a teacher at Reynolds High School in Troutdale, Oregon. The killer brought an AR-15 type rifle, a semi-automatic pistol and enough ammunition to do far more harm \u2013 but he was stopped before he could by police. The police were aided by a brave teacher and by a well-practiced school lockdown. After being confronted and exchanging gunfire with police, the killer committed suicide. Police also found the killer had brought a large knife.\nWithin hours, politicians were already attempting to exploit the tragedy \u2013 including Oregon\u2019s U.S. Rep. Earl Blumenauer and U.S. Rep. Suzanne Bonamici, who were calling for expanded background checks.\nMore laws are not the answer\nFurther expanding background checks in Oregon \u2013 already one of the most restrictive states for background checks \u2013 would not have stopped the 15-year-old killer. It\u2019s already illegal for anyone under 18 to purchase a firearm in Oregon, and this killer didn\u2019t buy the guns \u2013 he took them \u2013 by defeating his family\u2019s security measures where the guns were secured.\nAdditionally, the 15-year-old killer at Reynolds brought firearms and a large knife into a high school despite school policies banning weapons, Oregon state law prohibiting bringing or having a weapon at school, and the federal Gun-Free School Zones Act.\nThe killer was also in violation of ORS 166.250 (unlawful possession of firearms) even before he got to the school because he was under 18, and he was in violation of the Multnomah County gun control ordinance passed last year.\nThe 15-year-old killer violated many, many existing weapons-related laws \u2013 and none of those laws stopped him.\nWhat did work?\nReynolds High School has two full-time officers on campus every day \u2013 and they were on the scene in less than a minute \u2013 and they are being credited with stopping the killer before he could inflict more carnage.\nAfter Sandy Hook, the NRA\u2019s Wayne LaPierre advocated for a police officer in every school. He said \u201cThe only thing that stops a bad guy with a gun is a good guy with a gun.\u201d The media, who widely ridiculed and mocked him at the time, is largely ignoring this important part of the story. It\u2019s important because it actually worked to save lives at Reynolds.\nWhat also worked was the bravery of a Reynolds teacher, who sounded the alert that initiated the school lockdown even though he had been wounded. A lockdown that had recently been practiced. KOIN reported \u201cPolice said the response from school staff, police and students prevented \u2018many, many\u2019 deaths, given the amount of ammunition [the killer] carried. Troutdale Mayor Doug Daoust said Reynolds High had participated in an active shooter drill in the weeks leading up to June 10.\u201d\nWhat more can be done?\nThere needs to be a continued focus on mental health, including increasing funding where it\u2019s needed. More schools should have armed law enforcement present. The media should stop giving publicity to the killers. There needs to be more funding for prosecutors, police and jails so that existing laws can be enforced.\nAdditionally, there needs to be public pressure on the media to stop perpetuating false and highly emotional data and perceptions. For example, all the irresponsible news agencies who reported there have been 74 school shootings since Sandy Hook in Newtown, CT. The 74 number came from Michael Bloomberg\u2019s anti-gun group Everytown for Gun Safety. CNN has since taken the time to investigate the number and found the actual number to be 15.\nIt is indicative of how gun control groups are willing to stretch the truth to achieve their agenda. When they do that, and when the media parrots it, it makes rational discussion much more difficult.\nPlease let\u2019s move off of fixed political agendas and focus on what works to keep our children and everyone safer.\nUPDATE (11\/12\/2014): The Oregonian reported that the guns used by Jared Padgett had been locked up: \u201cInvestigators determined that the AR-15 rifle that Padgett used in the school shooting had been locked in a Pelican-brand gun case in the bedroom he shared with his older brother, Lucas Padgett. A .25-caliber handgun found on Jared Padgett after he had killed himself with the rifle the day of the June 10 shooting had been locked in his father Michael Padgett\u2019s bedroom, investigators concluded.\u201d\nTo read more from Dan, visit www.dan-lucas.com\n","summary":null,"authors":["In The News"],"publish_date":"06-13-2014","status":"success","url":"http:\/\/oregoncatalyst.com\/27758-reynolds-high-school-tragedy-laws-answer.html","domain":"oregoncatalyst.com","warc_date":"2016-12-11T13:48:31Z","split":"train"} 2 | {"title":"Researchers eye vision-correcting displays for devices","text":"New electronic display technology can automatically correct for vision defects without glasses or contact lenses, according to researchers at the MIT Media Lab and University of California at Berkeley.\nThe technology has applications for everything from e-readers and tablets to smartphones and GPS displays. By building vision correction into electronic displays researchers are hoping to improve conditions in emerging markets where glasses and prescription lenses don't come easy.\nGordon Wetzstein, a researcher at MIT's Media Lab, penned a research paper describing the technology. The paper will be presented at Siggraph, a graphics conference, later this month. Ramesh Raskar, director of the Media Lab's Camera Culture group, and Berkeley\u2019s Fu-Chung Huang and Brian Barsky are also listed on the paper.\nThe vision correction screens are a spin on the glasses-free 3D technology also developed at MIT. MIT's 3D screen projects different images to the left and right eye. The vision-correction version slightly different images to parts of the viewer's pupil.\nVision-correcting displays use software and algorithms to simulate an image at the proper focal distance to be seen correctly. Given advances in screen technology such as the Amazon Fire's glasses free 3D effects--- also known as dynamic perspective ---some variation of vision correction may be possible commercially in the future.\nKey points include:\n","summary":"Researchers at the MIT Media Lab and University of California at Berkeley say the technology could improve conditions in emerging markets where glasses and prescription lenses don't come easy.","authors":["Larry Dignan"],"publish_date":"07-31-2014","status":"success","url":"http:\/\/www.zdnet.com\/article\/researchers-eye-vision-correcting-displays-for-devices\/","domain":"zdnet.com","warc_date":"2016-12-11T14:27:02Z","split":"train"} 3 | {"title":"'American Idol' Recap: Burnell Taylor Exits, Power Rankings","text":"Not much to report here, just another results show on \"American Idol,\" meaning another guy has been sent home from the big stage. Four-in-a-row with no sign of letting up.\nOf course, there is that one slight sign: Lazaro somehow sneaking into the top three. Is this backlash against a show so obviously rigged from the start for a girl to win? Is Vote for the Worst's final season really having an impact and could they finally reach their ultimate goal? Do people actually like Lazaro?\nNah, after he nearly went home last week, it seems like more of an aberration after the first performance of his in weeks the judges didn't lambaste. Look for him to hit leadoff next week for a quick and easy exit and the 'Idol' producers can have their all-female final five and say, \"See, we're not just a show where a white guy with a guitar wins every season.\"\nYes, they had to stack the deck and make it the theme of the season to accomplish that, but at least they're going to get done what they wanted.\nAnd with that, here are some quick power rankings:\n1 (Rank Last Week: 1) - Angie Miller\nStill the presumptive winner and the judges' clear favorite. She's never been in any danger and actually comes closes to cute guy with guitar of anyone this season.\n2 (2) - Janelle Arthur\nShowed some inconsistency this week, but she still seems like the country girl best bet this year.\n3 (4) - Kree Harrison\nAnother top three for Kree and a clean sweep of the judges' ballots. Somehow she's catching on, but she's gotta wear thin at some point.\n4 (3) - Candice Glover\nIt's been shown dozens of times: The best pure singer never wins. That's Candice this year.\n5 (5) - Amber Holcomb\nOr maybe it's Amber, either way...\n","summary":null,"authors":["Andrew Payne","Noelle Talmon","Starpulse Staff"],"publish_date":"04-05-2013","status":"success","url":"http:\/\/www.starpulse.com\/american-idol-recap-burnell-taylor-exits-power-rankings-1848339569.html","domain":"starpulse.com","warc_date":"2016-12-11T14:41:58Z","split":"train"} 4 | {"title":"Social Democrats trail in polls despite anniversary","text":"Photo: DPA\nCentral Berlin will be turned into a huge celebration of the Social Democratic Party (SPD) this weekend, with the famous Strasse des 17. Juni closed to traffic and crammed with beer and sausage stands and live music to mark the party's 150th anniversary.\nComing fewer than 40 days before the general election, yet billed as a \"Germany festival\", the party has drawn criticism from political opponents who say it is nothing other than a campaign vehicle.\nThe district mayor, a member of Chancellor Angela Merkel's conservatives (CDU) had opposed the application for the party, but he was bypassed by SPD city bosses. Carsten Spallek argued that the party had already marked its anniversary in May in Leipzig, where it was founded.\nIt was there on May 23rd, 1863 that the General German Workers' Association or ADAV - the predecessor to the SPD - was set up in Leipzig by the pioneer of German socialism, Ferdinand Lassalle.\nBut Berlin's city transport minister Christian Gaebler, an SPD member, said: \"The SPD can, as Germany's oldest political party, claim the right to hold its summer jubilee festival on [Berlin's Strasse des] 17. Juni.\"\nIf it were to function as a political rally - and chancellor candidate Peer Steinbr\u00fcck is set to make an appearance - it would need to work very well to have much of an effect - the SPD is still trailing Merkel's CDU significantly.\nFigures published on Friday by public broadcaster ZDF suggested the SPD was on 25 percent of the vote, while the CDU's conservative bloc was on 41 percent.\nIt is only the continued poor performance of the CDU's current government coalition partner which is hindering a certain continuation of the pairing - the Free Democrats, FDP, are on the five-percent line, the minimum needed to send MPs to parliament.\nThis means the current coalition is on a combined 46 percent, while the SPD with its preferred coalition partner the Greens, would achieve 38 percent. On those figures neither grouping would have enough to form a majority government.\nBut on personal popularity, Merkel remains streets ahead of Steinbr\u00fcck, with 63 percent saying they would vote for her if the chancellor were to be directly elected. Steinbr\u00fcck, whose election campaign has failed to catch light, was the favourite of just 29 percent.\nAnd this year election polls could continue much longer than is normally the practice, after ZDF said it would break the convention of publishing the last one ten days before the election, before observing a poll black-out until the vote.\nThis year ZDF plans to publish the results of a poll three days before the election, saying people were deciding which way to vote later and later, and that the broadcaster did not want to keep its information from the public.\nStory continues below\u2026\nThis is likely to benefit the FDP more than any other party, academic Peter L\u00f6sche told the Tagesspiegel newspaper. He said people were only likely to be influenced by late opinion polls if they suggested a party was heading to fail to clear the five percent needed to get into parliament.\n\"There is nothing better that could happen for the FDP than lying at 4.5 percent in the ZDF opinion poll on the Thursday before the election,\" he told the paper.\nThe Local\/DPA\/hc\n","summary":"Central Berlin will be turned into a huge celebration of the Social Democratic Party (SPD) this weekend, with the famous Strasse des 17. Juni closed to traffic and crammed with beer and sausage stands and live music to mark the party's 150th anniversary.","authors":["The Local"],"publish_date":"08-16-2013","status":"success","url":"http:\/\/www.thelocal.de\/20130816\/51443","domain":"thelocal.de","warc_date":"2016-12-11T13:45:20Z","split":"train"} 5 | {"title":"U.S. Approves First Factory in Cuba Since Revolution","text":"HAVANA, Cuba \u2014 The Obama administration has approved the first U.S. factory in Cuba in more than 50 years, allowing a two-man company from Alabama to build a plant assembling as many as 1,000 small tractors a year for sale to private farmers in Cuba.\nThe Treasury Department last week notified partners Horace Clemmons and Saul Berenthal that they can legally build tractors and other heavy equipment in a special economic zone started by the Cuban government to attract foreign investment.\nCuban officials already have publicly and enthusiastically endorsed the project. The partners said they expect to be building tractors in Cuba by the first quarter of 2017.\n\"It's our belief that in the long run we both win if we do things that are beneficial to both countries,\" said Clemmons.\nPlay Facebook\nTwitter\nGoogle Plus\nEmbed Cuba Seeing More Tech Entrepreneurs 0:39 autoplay autoplay Copy this code to your website or blog\nThe $5 million to $10 million plant would be the first significant U.S. business investment on Cuban soil since Fidel Castro took power in 1959 and nationalized billions of dollars of U.S. corporate and private property. That confiscation provoked a U.S. embargo on Cuba that prohibited virtually all forms of commerce and fined non-U.S. companies millions of dollars for doing business with the island.\nLetting an American tractor company operate inside a Cuban government facility would have been unimaginable before Presidents Barack Obama and Raul Castro declared on Dec. 17, 2014, that they would restore diplomatic relations and move to normalize trade, travel and other aspects of the long-broken bilateral relationship.\nA view of the skyline in Havana, Cuba, August 2015. Sandra Lilley, NBC News\nSince then, Obama has been carving exceptions into the embargo through a series of executive actions, and his administration now says they allow U.S. manufacturing at the Mariel port and special economic zone about 30 miles west of Havana. One exception allows U.S. companies to export products that benefit private and cooperative farmers in Cuba. Berenthal and Clemmons say they will sell only to the private sector.\nThe Oggun tractor plant, named after a god in Cuba's syncretic Santeria religion, will assemble commercially available components into a durable and easy-to-maintain 25-horsepower tractor selling for less than $10,000, Clemmons and Berenthal said. The men believe they can sell hundreds of the tractors a year to Cuban farmers with financing from relatives outside the country and to non-government organizations seeking to help improve Cuban agriculture, which suffers from low productivity due mostly to excessive control of both basic supplies and prices by an inefficient, centrally planned state bureaucracy.\nPlay Facebook\nTwitter\nGoogle Plus\nEmbed U.S. Commerce Secretary Penny Pritzker In Cuba 0:21 autoplay autoplay Copy this code to your website or blog\n\"I have two countries that for 60 years have been in the worst of terms, anything I can do to bring to the two countries and the two people together is tremendously satisfying,\" said Berenthal, a Cuban-born semi-retired software engineer who left the country at age 16.\nBerenthal said they are optimistic that they will also be able to export Oggun tractors to other Latin American countries, which have low or no tariffs on Cuba products, making them competitive on price. The men expect a 10-20 percent profit on each tractor.\nFor the project's first three years, Clemmons and Berenthal say they will export components from the United States for assembly in Cuba. They hope to eventually begin manufacturing many of the parts themselves on the island. They said they expect to start with 30 Cuban employees and, if things go as planned, grow within five years to as many as 300.\nClemmons and Berenthal will publish all the schematics of their tractors online in order to allow Cubans and other clients to more easily repair their equipment and come up with designs for other heavy equipment based on the same frame and motor that Cleber can then produce at their Mariel factory.\nThe men already have plans to produce excavators, backhoes, trench-diggers and forklifts, equipment that's badly needed across Cuba, where virtually all the infrastructure is crumbling after years of neglect and mismanagement and a lack of cash that the government blames on the embargo.\n\"I think it'll have a tremendous impact on their ability not only to help their economy but to set an example across the Caribbean and Latin America,\" Berenthal said.\nFollow NBC News Latino on Facebook, Twitter and Instagram.\n","summary":"The U.S. has approved the first U.S. factory in Cuba in over 50 years, a two-man company from Alabama that will assemble tractors.","authors":["Associated Press"],"publish_date":"02-15-2016","status":"success","url":"http:\/\/www.nbcnews.com\/news\/latino\/u-s-approves-first-factory-cuba-revolution-n518926","domain":"nbcnews.com","warc_date":"2016-12-11T14:54:38Z","split":"train"} 6 | -------------------------------------------------------------------------------- /sample/april2019_set_mini.jsonl: -------------------------------------------------------------------------------- 1 | {"url":"https:\/\/techcrunch.com\/2019\/04\/19\/malwaretech-legal-case-over\/","url_used":"https:\/\/web.archive.org\/web\/20190424064330id_\/https:\/\/techcrunch.com\/2019\/04\/19\/malwaretech-legal-case-over\/","title":"Malware researcher Marcus Hutchins pleads guilty, ending his legal case \u2013 TechCrunch","text":"Malware researcher Marcus Hutchins has pleaded guilty to two counts of creating and selling a powerful banking malware, ending a long and protracted battle with U.S. prosecutors.\nHutchins, a British national who goes by the online handle MalwareTech, was arrested in August 2017 as he was due to fly back to the U.K. following the Def Con security conference in Las Vegas. Prosecutors charged Hutchins with his involvement with creating the Kronos banking malware, dating back to 2014. He was later freed on bail.\nA plea agreement was filed with the Eastern District of Wisconsin, where the case was being heard on Friday. His trial was set to begin later this year.\nHutchins agreed to plead guilty to distributing Kronos, a trojan that can be used to steal passwords and credentials from banking websites. In recent years, the trojan has continued to spread. He also agreed to plead guilty to a second count of conspiracy.\nHutchins faces up to 10 years in prison. Prosecutors have dropped the remaining charges.\nIn a brief statement on his website, Hutchins said: \u201cI regret these actions and accept full responsibility for my mistakes.\u201d\n\u201cHaving grown up, I\u2019ve since been using the same skills that I misused several years ago for constructive purposes,\u201d he said. \u201cI will continue to devote my time to keeping people safe from malware attacks.\u201d\nHis attorney Marcia Hofmann did not immediately return a request for comment.\nHutchins rose to prominence after he stopped the spread of the WannaCry ransomware attack in May 2017, months before his arrest. The attack used powerful hacking tools developed by the National Security Agency, which were later leaked, to backdoor thousands of Windows computers and install ransomware. The attack was later attributed to hackers backed by North Korea, knocking U.K. hospitals offline and crippling major companies around the world.\nBy registering a domain name found in the malware\u2019s code, Hutchins stemmed the spread of the infection. He was hailed a hero for stopping the attack.\nPrior to his release and after, Hutchins gained further praise and respect from the security community for his contributions to the malware-reversing field, and demonstrating his findings so others can learn from his findings.\nJustice Department spokesperson Nicole Navas declined to comment.\n","summary":null,"authors":[],"publish_date":"04-19-2019","domain":"techcrunch.com","warc_date":"20190424064330","status":"success_wayback","split":"gen","inst_index":24030} 2 | {"title":"Premier Horgan pledges province's capital plan to help boost value-added timber manufacturing","text":"Premier John Horgan promised that the province will build a major modernization of the Royal B.C. Museum using mass-timber components as a first step to kickstart demand for made-in-B.C., value added products.\nHorgan, speaking at the Council of Forest Industries convention in Vancouver on Friday, offered the measure as an incentive for the industry\u2019s co-operation in a process to revitalize the province\u2019s Interior forest sector, which is grappling with the challenge of shrinking timber supplies due to wildfires and the mountain pine beetle infestation.\nAnd it won\u2019t just be the museum revitalization. Horgan said a new $1.9-billion St. Paul\u2019s Hospital will use timber wherever it can. As for his government\u2019s $20-billion capital plan, \u201cto the greatest extent possible, mass timber will be the foundation of that construction.\u201d\n\u201cAnnouncing that the Royal B.C. Museum upgrade will be done with mass timber is transformative for the institution and also for the sector,\u201d Horgan said. \u201cSimilarly, St. Paul\u2019s and other projects will create a domestic market, and from there, we can start to market to other jurisdictions.\u201d\nHorgan talked about the concept as an election promise in 2017, but held it out Friday as an opportunity to an audience of some 650 industry representatives as he sought buy-in for a process to speed along a transformation of the industry from high-volume lumber production to high-value manufacturing.\nHowever, the industry is in the middle of shedding jobs by closing mills or reducing shifts to cope with those shrinking timber supplies at the same time as Horgan\u2019s government is trying to create jobs in forestry.\n\u201cIf we don\u2019t have a transformation away from the high-volume to the high-value economy, we\u2019re going to be struggling,\u201d Horgan said. \u201cAnd this is not a surprise to anyone, nor did it just arrive on my watch. But as we deal with that downturn, we need to also deal with the approach.\n\u201cAnd that\u2019s where I\u2019m asking the industry to be innovative,\u201d Horgan said, calling for companies, First Nations, community leaders and unions to take part in consultation talks. The process won\u2019t replace existing consultations with First Nations or timber-allocation discussion, Horgan said. The idea is for people in B.C.\u2019s Interior to come up with a common vision for revitalizing forestry.\n\u201cIf I announce, as often happens at these COFI conventions, \u2018This is the government\u2019s view,\u2019 there would be 100 people calling in to question the value of those incentives or restrictions,\u201d Horgan said. \u201cIf I ask the people that are dependent on the industry to sit down and come up with a common position, we\u2019re going to get better outcomes.\u201d\nHorgan sent a letter, dated April 1, to industry executives, First Nations, community and labour leaders looking for their participation in the engagement process. Instead of a top-down prescription, Horgan said government is looking for, and will support, development of an \u201cinclusive, implementable TSA-level vision for industry competitiveness and community economic stability.\u201d\nThe industry welcomed Horgan\u2019s pledge to use mass timber in public projects to help create a bigger domestic market in B.C.\u2019s export-dependent forestry sector.\n\u201cWhat the premier talked about today, what we\u2019ve talked a lot about, is the best way to approach helping our industry is to grow demand,\u201d COFI CEO Susan Yurkovich said.\nHowever, Yurkovich said industry representatives are still figuring out Horgan\u2019s offer to take part in the engagement process for revitalizing the Interior\u2019s forest industry.\n\u201cI understand there is a desire to have a process to address challenges in the Interior and do that collectively, and of course I\u2019m supportive of that,\u201d Yurkovich said. \u201c(But) I don\u2019t fully understand the process.\u201d\nJohn Rustad, the MLA for Nechako Lakes and the B.C. Liberal party\u2019s forestry critic, said industry representatives are telling him that government is creating uncertainty that has them holding back on investments in what has become a high-cost place to do business.\n\u201cGovernment needs to start looking at the challenges that industry has and have its initiatives in that context,\u201d said Rustad, who was in attendance at the convention. \u201cHow do we make the industry more competitive?\u201d\ndepenner@postmedia.com\ntwitter.com\/derrickpenner\nCLICK HERE to report a typo.\nIs there more to this story? We\u2019d like to hear from you about this or any other stories you think we should know about. Email vantips@postmedia.com\n","summary":"B.C. will use mass timber to the greatest extent possible in building public housing, schools and hospitals within its $20-billion capital plan, Horgan promised.","authors":["Updated"],"publish_date":"04-06-2019","status":"success","url":"https:\/\/vancouversun.com\/news\/local-news\/premier-horgan-pledges-provinces-capital-plan-to-help-boost-value-added-timber-manufacturing","domain":"vancouversun.com","warc_date":"2019-04-21T17:21:03Z","split":"gen","inst_index":76452} 3 | {"url":"https:\/\/www.chicagotribune.com\/lifestyles\/askamy\/ct-ask-amy-04212019-story%2Camp.html","url_used":"https:\/\/web.archive.org\/web\/20190423032533id_\/https:\/\/www.chicagotribune.com\/lifestyles\/askamy\/ct-ask-amy-04212019-story,amp.html","title":"Ask Amy: Uncle needs to do some uncle-ing","text":"Dear Amy: My parents spend winters in warmer climates. Their return home will coincide with a family get-together at their house. In attendance will be my 20-something niece and her boyfriend \u2014 whose behavior needs improvement. He\u2019s a nice guy otherwise, but he is evidently unaware of how to carry himself thoughtfully.\nAt my parents\u2019 60th wedding anniversary party, for example, he and my niece occupied the only prime space, directly across from my parents, using my parents as a backdrop for their make-out session.\nI would have preferred those two seats be occupied by my brother and me, so we could be physically close to my parents during this celebration of their marriage. At Christmas at my parents\u2019 house, my niece\u2019s boyfriend occupied\/reserved the front of the buffet line while everyone else helped to prepare it. He stood there (literally) wiping his dripping nose with his fingers and then transferring those drippings to the serving utensil he picked up immediately afterward.\nI\u2019m afraid if I say anything (praise in public, coach in private) my youngest brother will hear, go ballistic and temporarily avoid family functions, which would devastate my 80-year-old mother. Can anything be done?\n\u2014 Upset Uncle\nDear Uncle: It is the aunt\/uncle\u2019s time-honored prerogative to offer gentle suggestions to clueless young-adult nieces\/nephews. This is not parenting. This is uncle-ing.\nAnd so, if the couple is sitting where you believe you and your brother should be sitting, you say, \u201cHi guys, would you mind moving over two chairs so my brother and I can sit next to our parents?\u201d\nREAD MORE: You're ready for your adult child and his kid to move out. How do you handle? \u00bb\nIn terms of the buffet hoarding (a pet peeve of mine), in our large family we have dealt with this by one or more elders leading a blessing before the serving, acknowledging and publicly thanking the people who prepared the food, and then stating: \u201cLet\u2019s let the older people go through the line first, so they can get themselves situated. Then the rest of us can go through.\u201d\nI can\u2019t speak to your younger brother\u2019s choice to go ballistic. You are not offering judgments here; you are merely demonstrating some leadership.\nDear Amy: The best grocery in my town has great made-to-order sandwiches at an excellent price. I have been effusive in my praise and thanks to the woman at the counter who is always there in the early morning when I typically place my order.\nOver time, the attitude of the person making the sandwiches has changed visibly, where it is clear that she is unhappy to see me and to make the sandwich. Once I realized that my order interrupts her morning prep work, I have now minimized those requests. On several occasions, I\u2019ve even apologized to her for placing the order.\nBut recently I ordered a sandwich later in the day and got the same unfriendly response. And now I have begun to wonder whether this is her problem, or mine.\nThere is no restriction posted on when a sandwich can be ordered. So what do I do? Say something to her? Say something to the manager? Stop ordering a sandwich that is healthy and tasty because it makes the employee unhappy?\nI don\u2019t want to do anything to hurt this person at her job, as I see she works very hard and I am sympathetic to her. But there is something upside down about this and I don\u2019t know what to do.\n\u2014 Sandwich Guy\nDear Sandwich Guy: This is not your problem. So far, the only problem I see is that you are apologizing for patronizing a local business, and cheerfully and respectfully asking someone to do her job.\nIf you order this same thing every day (it sounds as if you do), then the person working there should anticipate this. If you are courteous (it sounds as if you are), then the person working there should respond in kind.\nI wonder if you have seen her comportment toward other customers. Is she grouchy toward everyone? Have you also seen the legendary \u201cSoup Nazi\u201d episode of \u201cSeinfeld\u201d? Viewing this might help put this episode into perspective.\nLikely her demeanor has nothing to do with you. Order and enjoy.\nDear Amy: I was disgusted by the question from \u201cUpset,\u201d whose husband insisted on texting while driving.\nI wish you had suggested to her that she might enjoy visiting her husband in jail after he causes a tragic accident.\n\u2014 Disgusted\nDear Disgusted: The high volume of responses to this question demonstrates how worried people are about sharing the road with distracted drivers.\n(You can contact Amy Dickinson via email: askamy@amydickinson.com. Readers may send postal mail to Ask Amy, P.O. Box 194, Freeville, NY 13068. You can also follow her on Twitter @askingamy or \u201clike\u201d her on Facebook.)\nCopyright 2019 by Amy Dickinson\nDistributed by Tribune Content Agency\nYour new partner asks for a loan. Here's how to say 'no' without ruining the romance \u00bb\nShould significant others be excluded from the family photo? \u00bb\n","summary":"Uncle is worried about niece's boyfriend's behavior.","authors":["Amy Dickinson"],"publish_date":"04-21-2019","domain":"chicagotribune.com","warc_date":"20190423032533","status":"success_wayback","split":"gen","inst_index":10011} 4 | {"title":"Fox and Friends Segment Suggests Climate Change is the Result of Friction","text":"What a magical, inspiring time for right-wing thought. On Friday morning, Fox & Friends welcomed \u201csocial media superstars and Fox Nation contributors\u201d Diamond and Silk to discuss climate change and immigration, and if you\u2019d rather just skip ahead to the part where you bang your head against your desk, I won\u2019t blame you.\nIn case you don\u2019t feel like losing five minutes of your one precious life to something that will make you noticeably more stupid, here\u2019s the gist: Diamond and Silk got airtime on a national cable news network to make the claim that climate change is real, but it is to be expected because our planet is rotating very quickly, every single day of the year.\nThe climate is indeed changing, but, according to Diamond and Silk, it\u2019s not due to the pollution we\u2019re spewing into the sky. It\u2019s not carbon or methane. It is, I guess, friction, and the purpose of the Green New Deal is to literally stop the Earth from turning, and reset it. This is a thing that was said on television this very morning, and your Dad watched it, and so did the president of the United States, and all of them echoed Silk\u2019s assessment: \u201cMm-hmm.\u201d\nThe hosts then turned their attention to Beto O\u2019Rourke, who at a recent campaign event called President Trump out for painting Mexicans as \u201crapists\u201d and asylum seekers as \u201canimals.\u201d Asked to respond to O\u2019Rourke, Diamond said this, and I am not kidding: \u201cWhen I listen to Beto O\u2019Rourke, his rhetoric, it reminds me of a slave owner. Anytime you want to tear down existing walls, and allow poor people to flow into our country, and have them living in the shadows, working for slave wages, that makes you a slave owner.\u201d\nIf you were wondering whether they mentioned our president, who continues to benefit from cheap, undocumented labor at his hotels and golf clubs, just kidding, you weren\u2019t wondering that. \u201cBeto O\u2019Rourke need to look at himself,\u201d said Diamond. \u201cAnd his own rhetoric,\u201d Silk chimed in. \u201cRight. Is what he need to do,\u201d Diamond concluded, over a chyron promising even more of this on Fox Nation, for just $5.99 a month.\nFox News\u2014really, cable news in general\u2014is not the place for deep and nuanced thought, but \u201cclimate change is because the Earth is speeding\u201d and \u201cimmigration reform is slavery\u201d are both alarmingly stupid ideas. And once again, there is a one-hundred percent chance the president is watching and nodding along.\nPete Buttigieg, the mayor of South Bend, Indiana, is a Democratic presidential hopeful who was recently attacked by a conservative blog for suggesting Jesus would be cool with bestiality. Buttigieg, who's known as Mayor Pete, did not suggest this. The Washington Post Getty Images\nAll of this comes one day after cable-news pundit and, I don\u2019t know, viking cosplayer Erick Erickson published a post on his blog The Resurgent about rising Democratic Presidential hopeful Mayor Pete Buttigieg. Mayor Pete is an observant Christian who supports a woman\u2019s right to make her own reproductive decisions, and he recently called out the religious right for \u201csaying so much about what Christ said so little about, and so little about what he said so much about.\u201d Pretty much every other Democratic Presidential hopeful holds this view; all are pro-choice, nearly all claim to be Christian. So it is interesting that only the first-ever openly gay one gets to be the subject of this headline:\nMayor Pete Buttigieg Apparently Thinks Jesus Would Be Okay With Bestiality.\nThe claim itself is too ignorant to put any energy into rebutting, but it does bear pointing out to my right-wing gay grifters out there: this is what they think of you. Whether you try to fit into their definition of respectable or not, even if you go on Tucker Carlson and eat a Chick-Fil-A sandwich, when push comes to shove, you\u2019re no better than a goat-fucker.\nIs everything getting dumber, or are we just dizzy because the world is spinning so fast?\n","summary":"Diamond and Silk told Fox News viewers that climate change is real\u2014but the result of the earth rotating.","authors":["Dave Holmes"],"publish_date":"04-05-2019","status":"success","url":"https:\/\/www.esquire.com\/news-politics\/a27056109\/fox-and-friends-diamond-and-silk-climate-change\/","domain":"esquire.com","warc_date":"2019-04-19T23:49:39Z","split":"gen","inst_index":51645} 5 | {"title":"Did The High Court Just Pick Sides In the T-Series vs PewDiePie War, Banning Diss Tracks?","text":"If Indian corporations had a rap battle, it would start with \u201cWho are you even?\u201d proceed to the \u201cYou don\u2019t know who I am!\u201d stage and end with a \u201cHindi Mein Bol! Tu Roadie Banega?\u201d Every time they argue online, it eventually boils down to calling a ban and basically, some of us simply cannot take trash talk. If a certain YouTube celebrity went into a trash talk battle, he starts out acting like he doesn\u2019t care and then drop all the clues of caring way too much, enough to make you care too. This is precisely what went down between T-Series and PewDiePie. The weeks-long rivalry for most subscribers on YouTube came to a close call several times during which they had quite a few run-ins.\nJust when T-Series was on a winning streak, earlier this month, overtaking PewDiePie, he took digs at the caste system in a video titled \u2018Congratulations\u2019 (don\u2019t try googling it). Too bad it you haven't already watched it because if you try now, this is what you'll see:\nThe video featured PewDiePie being quite a sore loser. But, we guess T-Series wanted to win that race too. T-Series not only deployed a hoard of Indians (read: I\u2019m not kidding, they made celebrities appeal to the masses) to take PewDiePie down.\nYou can make India win! It\u2019s so exciting to know that @Tseries, is on the brink of becoming the biggest YouTube channel in the world! Good luck @itsBhushanKumar! Let\u2019s all subscribe to ensure #BharatWinsYouTube pic.twitter.com\/um8cmiWcUr \u2014 Varun ZAFAR Dhawan (@Varun_dvn) March 7, 2019\nTaking it to a whole new level appealed to the law after facing some serious humiliation at the hands of the YouTube content creator. The Delhi High Court issued an order to take down the diss tracks \u201cCongratulations\u201d on T-Series\u2019 and \u201cBitch lasagna\u201d. In Congratulations, he blatantly claimed the music label puts out pirated music amongst other defamatory remarks. The High Court, stated abusive, vulgar and racist comments as the reason for the legal move.\nThe verdict, however, isn\u2019t out. There\u2019s going to be another hearing July. Till then, the diss tracks will still be unavailable for Indian viewers.\n","summary":"Most \"sore loser\" on YouTube is the new battle!","authors":["Tanzim Pardiwalla"],"publish_date":"04-12-2019","status":"success","url":"https:\/\/in.mashable.com\/culture\/2952\/did-the-high-court-just-pick-sides-in-the-t-series-vs-pewdie","domain":"mashable.com","warc_date":"2019-04-24T16:40:00Z","split":"gen","inst_index":114627} 6 | {"title":"Brexit: Corbyn told to back new EU referendum or lose millions of supporters","text":"Jeremy Corbyn has been warned by Labour\u2019s leader in the European parliament and other grandees that the party will be deserted by millions of anti-Brexit voters if it fails to clearly back a second referendum in its manifesto for next month\u2019s EU elections.\nThe message from Richard Corbett, who leads Labour\u2019s 20 MEPs, comes amid growing fears at the top of the party that it could lose a generation of young, pro-EU voters if it does not guarantee another public vote.\nThat age group, as well as many other Remainers, MPs say, could turn instead to unambiguously anti-Brexit parties, including the fledgling independent group Change UK, the Liberal Democrats, the Greens and the SNP.\nCorbett said: \u201cIf Labour does not re-confirm its support for a confirmatory public vote on any Brexit deal in its manifesto then it will haemorrhage votes to parties who do have a clear message. If on the other hand we do offer clarity and a confirmatory ballot we could do very well.\u201d\nRichard Corbett, leader of Labour\u2019s 20 representatives in the European parliament. Photograph: Imageplotter\/REX\/Shutterstock\nWhile Labour says it is keeping the option of a second referendum on the table in talks with the government, some key figures close to Jeremy Corbyn have been reluctant to confirm that another public vote would be held on any deal that is agreed and approved by parliament. This has prompted speculation that there may be no commitment to one in Labour\u2019s European election manifesto.\nShould Labour support a second Brexit referendum? | Jess Phillips and Gloria de Piero Read more\nFormer Labour foreign secretary Margaret Beckett also called for the manifesto to back a second public vote, saying: \u201cIt is very important that there is a clear message about where Labour stands and what Labour is offering. In my view that clear and simple message should be that there should be a confirmatory vote of the British people.\n\u201cThere is a great opportunity for Labour if we are clear. But lack of clarity would cost us support not only in these elections but it will feed through into the next general election and that may not be far away.\u201d\nThe issue of whether Labour commits to another referendum in its European manifesto, or fudges the issue to avoid alienating its Leave-supporting voters, is already renewing tensions at the top of the party. Those in the shadow cabinet who believe the manifesto should have a second referendum pledge at its heart fear they will be cut out of discussions and that the content and wording will be decided by Corbyn\u2019s office and the national executive committee (NEC), which is dominated by Corbyn supporters. On Saturday night a senior party source said responsibility for what would be in the manifesto would be \u201can NEC decision in consultation with stakeholders\u201d. Second referendum supporters in the shadow cabinet \u2013 Keir Starmer, Emily Thornberry and Tom Watson \u2013 are likely to insist, however, that it is also fully involved.\n\u201cIt would be beyond unacceptable if the shadow cabinet is not able to approve the document and it is all done by the NEC and leader\u2019s office,\u201d said another shadow cabinet member. Opponents of another referendum in the shadow cabinet, including party chairman Ian Lavery, warn that Labour will lose support among its Leave voters if it backs a second vote.\nAn Opinium poll for the Observer today finds that just 17% of people who say they are certain to vote in the European elections would choose the Tories, against 29% who would back Labour. Some 26% say they would back pro-Remain parties \u2013 the Liberal Democrats (10%), the SNP (6%) the Greens (6%) and Change UK (4%) \u2013 while 25% would back either Ukip (13%) or Nigel Farage\u2019s new Brexit party (12%). Although it is now almost three years since the June 2016 referendum put the UK on course to leave the EU, European leaders last week insisted Britain would have to take part in European elections at the end of next month as a condition for extending membership until 31 October, unless a Brexit deal passed through parliament before 22 May.\nLabour insiders say all but four of the party\u2019s current MEPs, who all back another referendum, want to stand again and will in all probability be confirmed as candidates this week.\nLabour deputy leader Tom Watson thinks Corbyn should follow the example set by Harold Wilson in 1975. Photograph: Getty Images\nOne senior party figure said: \u201cThe result of this is that even if our manifesto does not confirm a second referendum, that is what our candidates will be advocating on the doorsteps.\u201d\nAt the last European elections in 2014 \u2013 in which Ukip won the most seats \u2013 responsibility for writing Labour\u2019s European election manifesto was delegated to a sub-group of the national policy forum. But this time, given the hugely increased profile of the elections, there are demands for the process to widened.\nWatson said Labour had to tread carefully and suggested the party follow the lead of Harold Wilson, who in 1975 allowed MPs and his cabinet to vote according to their consciences in the referendum confirming UK membership of the European Community.\nHe said: \u201cA Labour government would be duty bound to deliver the result of a confirmatory referendum, whatever that may be. The public must trust us to honour that result so it makes sense for our party leadership to take a careful position and our MPs to be allowed to campaign with their consciences. Wilson\u2019s example is a good one. He kept the government and country together.\u201d\nLabour MP Stephen Kinnock, who has warned that another referendum would damage trust in democracy, said the focus should be to reach a cross-party agreement. He said: \u201cMost Labour colleagues are very encouraged by media reports \u2013 and by the prime minister\u2019s recent comments about a customs union \u2013 that we may be within touching distance of an exit deal that protects jobs, environmental standards and workers\u2019 rights.\n\u201cIf this is indeed the case, then it\u2019s vital that we do not allow the negotiations to be torpedoed by insisting on a public vote. It is just not realistic to hope the prime minister would ever whip her MPs to back a second referendum. The first task should be to get a \u2018Labour-shaped\u2019 deal agreed and embedded in the withdrawal agreement so it was not able to be ripped up by future Tory leader.\n\u201cThere will then of course be ample opportunity for colleagues to press their case for a second referendum on the basis of this renegotiated deal by attaching an amendment to the legislation needed to implement Brexit.\u201d\n","summary":"A generation of young people could desert the party, says Richard Corbett, leader of Labour MEPs","authors":["Jon Cruddas","Nick Lowles"],"publish_date":"04-13-2019","status":"success","url":"https:\/\/www.theguardian.com\/politics\/2019\/apr\/13\/corbyn-told-back-eu-referendum-or-lose-millions-voters-brexit","domain":"theguardian.com","warc_date":"2019-04-18T10:30:54Z","split":"gen","inst_index":27053} 7 | {"url":"https:\/\/www.theatlantic.com\/politics\/archive\/2019\/04\/how-key-aides-have-survived-trump-white-house\/587038\/","url_used":"https:\/\/web.archive.org\/web\/20190415131634id_\/https:\/\/www.theatlantic.com\/politics\/archive\/2019\/04\/how-key-aides-have-survived-trump-white-house\/587038\/?utm_source=feed","title":"A Survival Guide for the Trump White House","text":"\u201cIt\u2019s not just higher\u201d under Trump, Tenpas said. \u201cIt\u2019s off the charts.\u201d\nNot all that many high-ranking aides remain from the day Trump gave his inaugural address. Two-thirds of Trump\u2019s most senior aides have left or been promoted, more than every president going back to Ronald Reagan in the 1980s, Tenpas\u2019s research shows. But there are a few survivors. Amid all the tumult, some have even flourished. The keys: praising Trump, mastering skills that he values, and forging alliances in a rivalrous West Wing.\nIf none of that works, plant yourself in front of a TV camera and impress the boss.\nConsider Stephen Miller, the 33-year-old who started during the campaign. Miller\u2019s portfolio is expanding. In recent weeks, Trump has given him a lead role on Miller\u2019s pet issue\u2014immigration policy. Trump told him: \u201cWhatever you need, you\u2019re empowered in this space; get it done,\u201d a White House aide says.\nRead: Trump\u2019s right-hand troll\nMiller has long wanted tougher measures to control migration to the U.S. In that respect, he\u2019s in sync with Trump\u2019s core voters\u2014part of his value to the president. \u201cHe\u2019s a policy adviser who has the pulse of the president\u2019s base,\u201d a former White House official says. \u201cAnd like the president, he\u2019s skeptical of the standard Washington playbook on trade and immigration.\u201d\nMiller is a curious case in that Trump often prizes aides who perform well on cable TV, whose coverage he devours. Over the past two years, Miller has had some epic appearances on the small screen. In a cool medium, he runs hot. But he\u2019s defended Trump to the point that he\u2019s nearly gotten himself thrown off the set. And make no mistake: For a White House aide, the defense, promotion, and celebration of Trump is a durable survival strategy.\nIn an appearance last year on CNN (arguably Trump\u2019s least favorite news outlet), Miller denounced \u201c24 hours of negative anti-Trump hysterical coverage on this network.\u201d As Miller attempted to drown out the host with stories about Trump\u2019s virtues, Jake Tapper observed that he was really addressing an audience of one.\nAides say that Miller is a popular figure inside the building, though, bringing in bagels for the staff. At a birthday party for Miller and a couple of other aides on the chief of staff\u2019s patio, Trump made a surprise appearance.\nAn undeniable asset is his speechwriting skills, past and present White House aides say. He has a feel for Trump\u2019s voice\u2014a skill that Trump believes \u201che can\u2019t live without,\u201d one former aide says.\nMiller has also been nimble when it comes to internal White House politics. He was first thought to be an acolyte of the former chief strategist, Steve Bannon. Both touted economic nationalism, putting them at odds with a globalist faction led by the former economic adviser Gary Cohn and the president\u2019s son-in-law, Jared Kushner. But Miller made certain not to alienate the globalists nor to make Bannon\u2019s mistake of antagonizing Trump\u2019s family.\n","summary":"Amid West Wing turnover that is \u201coff the charts,\u201d the survivors have mastered the art of praising the boss.","authors":["Peter Nicholas"],"publish_date":"04-14-2019","domain":"theatlantic.com","warc_date":"20190415131634","status":"success_wayback","split":"gen","inst_index":9201} 8 | {"title":"Georgetown Students Vote to Add Fee for Slavery Reparations to Tuition","text":"The campus of Georgetown University as seen on March 12 in Washington. Win McNamee\/Getty Images\nUndergraduate students at Georgetown University voted overwhelmingly Thursday in favor of paying an additional fee to go toward reparations for the descendants of slaves sold by Georgetown in the 19th century, in what would, if approved by the university, be the first time an American university financially addresses its past as a slave-owning institution.\nThe \u201cReconciliation Contribution\u201d would charge students $27.20 per semester to go toward a fund, directed by a board of students and slave descendants, that would support projects in communities where some descendants of Georgetown\u2019s slaves now live. But a student vote is not a binding measure that sets policy for the larger university, and the administration has not committed to the fund. Instead, the university has said it sees the vote as \u201cvaluable insight into student perspectives,\u201d according to NBC News.\nAccording to the Georgetown University Student Association Elections Commission, which announced the results on Friday, of the nearly 60 percent of undergraduates who voted, 2,541 supported the measure and 1,304 opposed it.\nIn 1838, two Jesuit priests serving as the university\u2019s presidents sold 272 people enslaved by the Maryland Jesuits. The university was mired in debt at the time, and the profits from the sale\u2014$115,000, or about $3 million today, according to the Washington Post\u2014allowed the university to stay open. Proponents of the measure argued that the sale split apart families and sent some slaves South to horrific conditions on cotton and sugar plantations and that the university\u2019s debt to those enslaved people, and their descendants, should be addressed.\nIn 2016, Georgetown publicly recognized and apologized for its slave-holding past, and the university has since renamed the two buildings named after the Jesuit priests who organized the sale. It also gives preferential admissions (considering them as legacy admissions) to descendants of those 272 slaves. Four students are currently enrolled under that admissions policy, according to the Post.\nSeveral Democratic presidential candidates have expressed support for some form of reparations nationally\u2014or at least studying the possibility of reparations\u2014including Elizabeth Warren, Kamala Harris, Juli\u00e1n Castro, Beto O\u2019Rourke, and Pete Buttigieg.\n","summary":"They would pay $27 per semester toward a fund for the descendants of slaves sold in 1838.","authors":["Molly Olmstead"],"publish_date":"04-12-2019","status":"success","url":"https:\/\/slate.com\/news-and-politics\/2019\/04\/georgetown-students-reparations-fund-vote.html","domain":"slate.com","warc_date":"2019-04-25T16:22:51Z","split":"gen","inst_index":127836} --------------------------------------------------------------------------------