├── README.md ├── ct-feature-extraction ├── environment.yml ├── extract_ct_features.py ├── extract_ct_features.sh ├── features │ ├── .gitkeep │ ├── ct_features_omentum.csv │ └── ct_features_ovary.csv ├── make_windowed_vols.py ├── params_left_ovary25.yaml ├── params_omentum25.yaml ├── params_right_ovary25.yaml └── process_dataframes.py ├── feature-selection ├── environment.yml ├── results │ ├── .gitkeep │ ├── hr_ct_features_omentum.csv │ ├── hr_ct_features_ovary.csv │ └── hr_hne_features.csv ├── select_features.py └── select_features.sh ├── global_config.yaml ├── hne-feature-extraction ├── 0_get_cohort_csv.sh ├── 1_infer_tissue_types_and_extract_features.sh ├── 1_infer_tissue_types_and_extract_features.sub ├── 2_extract_objects.sh ├── 2_extract_objects.sub ├── 3_label_objects_and_extract_features.sh ├── 3_label_objects_and_extract_features.sub ├── _run_gpu_qupath_stardist_singleSlide.sh ├── bitmaps │ └── .gitkeep ├── connector.py ├── environment.yml ├── extract_feats_from_bitmaps.py ├── extract_feats_from_object_detections.py ├── final_objects │ └── .gitkeep ├── hne-feature-extraction.dag ├── infer_tissue_tile_clf.py ├── inference │ └── .gitkeep ├── map_inference_to_bitmap.py ├── merge_cells_and_regions.py ├── qupath │ ├── Makefile │ ├── README.md │ ├── data │ │ ├── results │ │ │ └── .gitkeep │ │ └── slides │ │ │ └── .gitkeep │ ├── detections │ │ └── CMU-1-Small-Region_2_stardist_detections_and_measurements.tsv │ ├── docker-compose.yml │ ├── dockerfile │ ├── init_singularity_env.sh │ ├── models │ │ ├── ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json │ │ └── he_heavy_augment │ │ │ ├── saved_model.pb │ │ │ └── variables │ │ │ ├── variables.data-00000-of-00001 │ │ │ └── variables.index │ └── scripts │ │ └── stardist_nuclei_and_lymphocytes.groovy ├── tissue_tile_features │ └── reference_hne_features.csv └── visualizations │ └── .gitkeep ├── license.md ├── survival-modeling ├── environment.yml ├── figures │ ├── .gitkeep │ ├── barplots │ │ └── .gitkeep │ ├── crs_plots │ │ └── .gitkeep │ ├── feature_plots │ │ └── .gitkeep │ ├── forest_plots │ │ └── .gitkeep │ ├── km_plots │ │ └── .gitkeep │ └── multimodal │ │ └── .gitkeep ├── results │ ├── .gitkeep │ ├── crs │ │ └── .gitkeep │ └── model_summaries │ │ └── .gitkeep ├── train_test.py └── utils.py └── tissue-type-training ├── checkpoints ├── .gitkeep └── tissue_type_classifier_weights.torch ├── config.py ├── confusion_matrix_analysis.py ├── cross_validate_on_annotations.sh ├── dataset.py ├── environment.yml ├── eval_tissue_tile.py ├── evals └── .gitkeep ├── general_utils.py ├── models.py ├── pred_tissue_tile.py ├── predictions └── .gitkeep ├── preprocess.py ├── pretile.py ├── pretilings └── .gitkeep ├── train_on_all_annotations.sh ├── train_tissue_tile_clf.py └── visualizations └── .gitkeep /README.md: -------------------------------------------------------------------------------- 1 | # OncoFusion 2 | This software extracts features from histopathologic whole-slide images, contrast-enhanced computed tomography, targeted sequencing panels, and clinical covariates and subsequently integrates them using a late-fusion machine learning model to stratify patients by overall survival. Repository to accompany [Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer](https://www.nature.com/articles/s43018-022-00388-9). 3 | 4 | ## Requirements 5 | 6 | Hardware: Tested on a server with 96 CPUs, 500GB CPU RAM, 4GPUs (Tesla V100, CUDA Version: 11.4), 64 GB GPU RAM, 1TB storage 7 | 8 | Software: Tested on Redhat Enterprise Linux v7.8 with Python v3.9, Conda v4.12, Singularity v3.8.3, and the conda environments specified in the environment.yml files in each sub-directory. 9 | 10 | ## Set up 11 | 12 | ### Download Synapse repository 13 | https://www.synapse.org/#!Synapse:syn25946117/wiki/611576 14 | 15 | ### Download H&E WSIs 16 | Download H&E WSIs listed within the downloaded Synapse repository at `data/hne/tcga/manifest.txt` using GDC Data Transfer Tool (https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/). Ensure a flat file structure (.svs files within the `data/hne/tcga` folder). 17 | 18 | ### Clone this GitHub repository 19 | It is recommended to clone the GitHub repository into the same directory as the Synapse repository. Conda environments are provided as `environment.yml` packages for each stage of the pipeline. 20 | 21 | ### Set global parameters 22 | In `global_config.yaml`, set the full paths to the directories enclosing the data and code. All scripts assume that the code and data are within subdirectories of these paths, enitled `code` and `data` respectively. 23 | 24 | ### Move Singularity image to code repository 25 | Move `qupath-stardist_latest.sif` from `data` to `code/hne-feature-extraction/qupath`. 26 | 27 | ## Tissue type training 28 | Using annotations by gynecologic pathologists (found within the `tissue-type-training` directory of the Synapse repository), train a semantic segmentation model to infer tissue type from H&E images. This component is optional: the resulting weights of our training are already stored in `tissue-type-training/checkpoints/tissue_type_classifier_weights.torch`. Other than paths set in global YAML file in the previous step, all options are set in `config.py`. For help, use `python config.py --h`. 29 | 30 | ### Cross-validate model for tissue type inference 31 | `tissue-type-training/cross_validate_on_annotations.sh` 32 | Use this to explore various model types and hyperparameter configurations. 33 | 34 | ### Train model for tissue type inference 35 | `tissue-type-training/train_on_all_annotations.sh` Note that `preprocess.py` and `pretile.py` must be run before this step (a sufficient example is in the cross validation script, so running that before this is sufficient). 36 | 37 | ## H&E feature extraction 38 | 39 | ### Extract tissue type features 40 | Next, we apply our trained model to semantically segment tissue types on slides from our multimodal patient cohort: `hne-feature-extraction/1_infer_tissue_types_and_extract_features.sh`. This is the process that ultimately generates the tissue type-based features in `hne-feature-extraction/tissue_tile_features/reference_hne_features.csv`. 41 | 42 | ### Identify nuclei 43 | Using the StarDist extension for QuPath, we perform instance segmentation of cellular nuclei and apply a bespoke classification script to distinguish lymphocytes from other nuclei: `hne-feature-extraction/2_extract_objects.sh`. Before running this script, move or copy slides of interest from `data/hne` to `code/hne-feature-extraction/qupath/data/slides`. 44 | 45 | ### Label nuclei by tissue type; extract nuclear features 46 | Finally, we coregister the two feature spaces and extract descriptive statistics for nuclei of each cell type: `hne-feature-extraction/3_label_objects_and_extract_features.sh`. This is the process that ultimately generates the nuclear features in `reference_hne_features.csv`. 47 | 48 | 49 | ## CT feature extraction 50 | We apply the abdominal window and extract features from omental and adnexal lesions contoured by fellowship-trained diagnostic radiologists: `ct-feature-extraction/extract_ct_features.sh`. Features are stored as a csv file in the features subdirectory. 51 | 52 | 53 | ## Feature selection 54 | We use log partial hazard ratios and their associated significance calculated on univariate Cox regression to select informative features from the CT and H&E feature spaces: `feature-selection/select_features.sh`. The log partial hazard ratio for each feature across the training cohort and the associated volcano plot is generated for each modality in the results subdirectory. 55 | 56 | 57 | ## Survival modeling 58 | Use `feature-selection/environment.yml`. 59 | We build univariate survival models for histopathologic, radiologic, clinical, and genomic information spaces. Subsequently, we combine the modalities in a late fuson framework and plot the performance: `survival-modeling/train_test.py`. Relevant results and figures are generated in the respective subdirectories. 60 | -------------------------------------------------------------------------------- /ct-feature-extraction/environment.yml: -------------------------------------------------------------------------------- 1 | name: pyrad 2 | channels: 3 | - conda-forge 4 | - anaconda 5 | - defaults 6 | dependencies: 7 | - _libgcc_mutex=0.1=main 8 | - _openmp_mutex=4.5=1_gnu 9 | - blas=1.0=mkl 10 | - ca-certificates=2022.3.29=h06a4308_1 11 | - certifi=2021.5.30=py36h06a4308_0 12 | - cudatoolkit=10.1.243=h6bb024c_0 13 | - cycler=0.11.0=pyhd3eb1b0_0 14 | - dbus=1.13.18=hb2f20db_0 15 | - expat=2.4.4=h295c915_0 16 | - fontconfig=2.13.1=h6c09931_0 17 | - freetype=2.11.0=h70c0345_0 18 | - future=0.18.2=py36_1 19 | - glib=2.63.1=h5a9c865_0 20 | - gst-plugins-base=1.14.0=hbbd80ab_1 21 | - gstreamer=1.14.0=hb453b48_1 22 | - icu=58.2=he6710b0_3 23 | - intel-openmp=2022.0.1=h06a4308_3633 24 | - joblib=1.0.1=pyhd3eb1b0_0 25 | - jpeg=9d=h7f8727e_0 26 | - kiwisolver=1.3.1=py36h2531618_0 27 | - ld_impl_linux-64=2.35.1=h7274673_9 28 | - libffi=3.2.1=hf484d3e_1007 29 | - libgcc-ng=9.3.0=h5101ec6_17 30 | - libgfortran-ng=7.5.0=ha8ba4b0_17 31 | - libgfortran4=7.5.0=ha8ba4b0_17 32 | - libgomp=9.3.0=h5101ec6_17 33 | - libpng=1.6.37=hbc83047_0 34 | - libstdcxx-ng=9.3.0=hd4cf53a_17 35 | - libuuid=1.0.3=h7f8727e_2 36 | - libxcb=1.14=h7b6447c_0 37 | - libxml2=2.9.12=h03d6c58_0 38 | - matplotlib=3.1.3=py36_0 39 | - matplotlib-base=3.1.3=py36hef1b27d_0 40 | - mkl=2020.2=256 41 | - mkl-service=2.3.0=py36he8ac12f_0 42 | - mkl_fft=1.3.0=py36h54f3939_0 43 | - mkl_random=1.1.1=py36h0573a6f_0 44 | - ncurses=6.3=h7f8727e_2 45 | - numpy-base=1.18.5=py36hde5b4d6_0 46 | - numpy-indexed=0.3.5=py_1 47 | - openssl=1.1.1n=h7f8727e_0 48 | - pandas=1.0.3=py36h0573a6f_0 49 | - pcre=8.45=h295c915_0 50 | - pip=21.2.2=py36h06a4308_0 51 | - pydicom=2.1.2=pyhd3deb0d_0 52 | - pyparsing=3.0.4=pyhd3eb1b0_0 53 | - pyqt=5.9.2=py36h05f1152_2 54 | - python=3.6.10=hcf32534_1 55 | - python-dateutil=2.8.2=pyhd3eb1b0_0 56 | - pytz=2021.3=pyhd3eb1b0_0 57 | - qt=5.9.7=h5867ecd_1 58 | - readline=8.1.2=h7f8727e_1 59 | - scikit-learn=0.22.1=py36hd81dba3_0 60 | - scipy=1.5.2=py36h0b6359f_0 61 | - setuptools=58.0.4=py36h06a4308_0 62 | - sip=4.19.8=py36hf484d3e_0 63 | - six=1.16.0=pyhd3eb1b0_1 64 | - sqlite=3.38.2=hc218d9a_0 65 | - tk=8.6.11=h1ccaba5_0 66 | - tornado=6.1=py36h27cfd23_0 67 | - wheel=0.37.1=pyhd3eb1b0_0 68 | - xlrd=1.2.0=py36_0 69 | - xz=5.2.5=h7b6447c_0 70 | - zlib=1.2.12=h7f8727e_2 71 | - pip: 72 | - docopt==0.6.2 73 | - medpy==0.4.0 74 | - numpy==1.18.2 75 | - pykwalify==1.7.0 76 | - pyradiomics==3.0.1 77 | - pywavelets==1.0.0 78 | - pyyaml==5.3.1 79 | - simpleitk==1.2.4 80 | -------------------------------------------------------------------------------- /ct-feature-extraction/extract_ct_features.py: -------------------------------------------------------------------------------- 1 | """ 2 | extract_ct_features.py extracts PyRadiomics-defined features from windowed CT volumes. 3 | """ 4 | 5 | import pandas as pd 6 | import numpy as np 7 | import os 8 | import logging 9 | import yaml 10 | 11 | from joblib import Parallel, delayed 12 | from radiomics import featureextractor 13 | 14 | 15 | def define_all_lesions_for_one_site(df, params_fn, results_fn): 16 | print('-'*32) 17 | print(params_fn) 18 | 19 | all_results = Parallel(n_jobs=16)(delayed(extract_single)(params_fn, row) for _, row in df.iterrows()) 20 | #all_results = [] 21 | #for _, row in df.iterrows(): 22 | # all_results.append(extract_single(params_fn, row)) 23 | results = pd.DataFrame(all_results) 24 | results['Patient ID'] = results['Patient ID'].astype(str).apply(lambda x: x.zfill(3)) 25 | results.to_csv(results_fn, index=False) 26 | 27 | 28 | def extract_single(params_fn, row): 29 | extractor = featureextractor.RadiomicsFeatureExtractor(params_fn) 30 | logger = logging.getLogger("radiomics") 31 | logger.setLevel(logging.ERROR) 32 | try: 33 | result = extractor.execute(os.path.join(DATA_DIR, row.windowed_image_path), 34 | os.path.join(DATA_DIR, row.segmentation_path)) 35 | print('Extracted features successfully for {}'.format(row.windowed_image_path)) 36 | except ValueError: 37 | result = {} 38 | print('WARNING: ValueError for {}'.format(row.windowed_image_path)) 39 | except RuntimeError: 40 | result = {} 41 | print('WARNING: RuntimeError for {}'.format(row.windowed_image_path)) 42 | 43 | result.update({'Patient ID': str(row['Patient ID'])}) 44 | return result 45 | 46 | 47 | if __name__ == '__main__': 48 | with open('../global_config.yaml', 'r') as f: 49 | CONFIGS = yaml.safe_load(f) 50 | DATA_DIR = CONFIGS['data_dir'] 51 | 52 | INPUT_DATAFRAME_PATH = os.path.join(DATA_DIR, 'data/dataframes/ct_df.csv') 53 | DF = pd.read_csv(INPUT_DATAFRAME_PATH) 54 | DF['Patient ID'] = DF['Patient ID'].astype(str) 55 | BINS = [25] 56 | 57 | for B in BINS: 58 | print('Extracting features for bin size {}'.format(B)) 59 | 60 | define_all_lesions_for_one_site(DF, 61 | 'params_left_ovary{}.yaml'.format(B), 62 | 'features/_ct_left_ovary_bin{}.csv'.format(B)) 63 | 64 | define_all_lesions_for_one_site(DF, 65 | 'params_right_ovary{}.yaml'.format(B), 66 | 'features/_ct_right_ovary_bin{}.csv'.format(B)) 67 | 68 | define_all_lesions_for_one_site(DF, 69 | 'params_omentum{}.yaml'.format(B), 70 | 'features/_ct_omentum_bin{}.csv'.format(B)) 71 | 72 | -------------------------------------------------------------------------------- /ct-feature-extraction/extract_ct_features.sh: -------------------------------------------------------------------------------- 1 | python make_windowed_vols.py 2 | python extract_ct_features.py 3 | python process_dataframes.py 4 | -------------------------------------------------------------------------------- /ct-feature-extraction/features/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/ct-feature-extraction/features/.gitkeep -------------------------------------------------------------------------------- /ct-feature-extraction/make_windowed_vols.py: -------------------------------------------------------------------------------- 1 | """ 2 | make_windowed_vols.py applies the abdominal window to the raw MHD files in data/dataframes/ct_df.csv 3 | and saves the resulting MHD files in data/ct/windowed_cts. TCGA MHD files must be downloaded before 4 | running. 5 | """ 6 | import pandas as pd 7 | import numpy as np 8 | import os 9 | import yaml 10 | 11 | from medpy.io import load, save 12 | from joblib import delayed, Parallel 13 | 14 | 15 | with open('../global_config.yaml', 'r') as f: 16 | DATA_DIR = yaml.safe_load(f)['data_dir'] 17 | 18 | LEVEL = 50 19 | WIDTH = 400 20 | LOWER_BOUND = LEVEL - WIDTH//2 21 | UPPER_BOUND = LEVEL + WIDTH//2 22 | INPUT_DATAFRAME_PATH = os.path.join(DATA_DIR, 'data/dataframes/ct_df.csv') 23 | 24 | 25 | def make_windowed(row): 26 | """ 27 | Given row of CT data frame, load MHD file, apply window, and save windowed version. 28 | """ 29 | input_tumor_img_fn = os.path.join(DATA_DIR, row['image_path']) 30 | output_tumor_img_fn = os.path.join(DATA_DIR, row['windowed_image_path']) 31 | try: 32 | tumor_img, header = load(input_tumor_img_fn) 33 | tumor_img = np.clip(tumor_img, a_min=LOWER_BOUND, a_max=UPPER_BOUND) 34 | sub_dir = '/'.join(output_tumor_img_fn.split('/')[:-1]) 35 | if not os.path.exists(sub_dir): 36 | os.mkdir(sub_dir) 37 | save(tumor_img, output_tumor_img_fn, header) 38 | print('{} succeeded'.format(input_tumor_img_fn)) 39 | except: 40 | print('{} failed'.format(input_tumor_img_fn)) 41 | 42 | 43 | if __name__ == '__main__': 44 | df = pd.read_csv(INPUT_DATAFRAME_PATH) 45 | if not os.path.exists(os.path.join(DATA_DIR, 'data/ct/windowed_scans')): 46 | os.mkdir(os.path.join(DATA_DIR, 'data/ct/windowed_scans')) 47 | Parallel(n_jobs=16)(delayed(make_windowed)(row) for idx, row in df.iterrows()) 48 | #for idx, row in df.iterrows(): 49 | # make_windowed(idx, row) 50 | -------------------------------------------------------------------------------- /ct-feature-extraction/params_left_ovary25.yaml: -------------------------------------------------------------------------------- 1 | setting: 2 | label: 46 3 | resampledPixelSpacing: [1.0, 1.0, 1.0] 4 | binWidth: 25 5 | interpolator: 'sitkBSpline' 6 | 7 | imageType: 8 | Wavelet: {} 9 | 10 | -------------------------------------------------------------------------------- /ct-feature-extraction/params_omentum25.yaml: -------------------------------------------------------------------------------- 1 | setting: 2 | label: 43 3 | resampledPixelSpacing: [1.0, 1.0, 1.0] 4 | binWidth: 25 5 | interpolator: 'sitkBSpline' 6 | 7 | imageType: 8 | Wavelet: {} 9 | 10 | -------------------------------------------------------------------------------- /ct-feature-extraction/params_right_ovary25.yaml: -------------------------------------------------------------------------------- 1 | setting: 2 | label: 45 3 | resampledPixelSpacing: [1.0, 1.0, 1.0] 4 | binWidth: 25 5 | interpolator: 'sitkBSpline' 6 | 7 | imageType: 8 | Wavelet: {} 9 | 10 | -------------------------------------------------------------------------------- /ct-feature-extraction/process_dataframes.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def make_df(input_filenames): 4 | # load & concatenate if necessary 5 | if type(input_filenames) is str: 6 | input_filenames = [input_filenames] 7 | 8 | dfs = [pd.read_csv(input_filename, engine='python') for input_filename in input_filenames] 9 | df = pd.concat(dfs) 10 | 11 | # drop NaNs and duplicates (for ovarian) 12 | df = df.dropna(how='any') 13 | df = df.drop_duplicates(subset=['Patient ID']) 14 | 15 | # remove extraneous columns 16 | df = df.set_index('Patient ID').filter(regex='^wavelet', axis=1) 17 | df = df.reset_index() 18 | 19 | return df 20 | 21 | 22 | if __name__ == '__main__': 23 | omentum_df = make_df('features/_ct_omentum_bin25.csv') 24 | omentum_df.to_csv('features/ct_features_omentum.csv', index=False) 25 | 26 | ovary_df = make_df(['features/_ct_left_ovary_bin25.csv', 27 | 'features/_ct_right_ovary_bin25.csv']) 28 | ovary_df.to_csv('features/ct_features_ovary.csv', index=False) 29 | -------------------------------------------------------------------------------- /feature-selection/environment.yml: -------------------------------------------------------------------------------- 1 | name: sklearn 2 | channels: 3 | - conda-forge 4 | - bioconda 5 | - defaults 6 | dependencies: 7 | - scikit-learn 8 | - pandas 9 | - lifelines 10 | - seaborn 11 | - xlrd 12 | - openpyxl 13 | - rasterio 14 | - shapely 15 | - albumentations 16 | - requests 17 | - bokeh 18 | - libtiff 19 | - colorcet 20 | - holoviews 21 | - pingouin 22 | - pip 23 | - pip: 24 | - statannot 25 | prefix: /Users/boehmk/anaconda3/envs/sklearn 26 | -------------------------------------------------------------------------------- /feature-selection/results/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/feature-selection/results/.gitkeep -------------------------------------------------------------------------------- /feature-selection/select_features.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | import lifelines 6 | import yaml 7 | import os 8 | import warnings 9 | 10 | from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler 11 | from scipy.stats import spearmanr, pearsonr, kendalltau, zscore, percentileofscore, chisquare, power_divergence 12 | from statsmodels.stats.outliers_influence import variance_inflation_factor 13 | from pingouin import partial_corr 14 | from lifelines import CoxPHFitter 15 | from lifelines.utils import concordance_index 16 | from joblib import Parallel, delayed 17 | from argparse import ArgumentParser 18 | from sklearn.cluster import KMeans 19 | from statsmodels.stats.multitest import multipletests 20 | 21 | warnings.simplefilter("ignore") 22 | 23 | FONTSIZE = 7 24 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE) 25 | plt.rc('xtick',labelsize=FONTSIZE) 26 | plt.rc('ytick',labelsize=FONTSIZE) 27 | plt.rc("axes", labelsize=FONTSIZE) 28 | 29 | def evaluate_feature_partial_correlation(df, feat_col, y_col, covariate_col, x_covariate_col, y_covariate_col, method): 30 | """ 31 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing column "duration" (float) of survival time 32 | :param feat_col: column name (str) of feature values (float) 33 | :param y_col: column name (str) of outcomes (float) 34 | :param covariate_col: column name (str) of XY covariate (float) 35 | :param x_covariate_col: column name (str) of X covariate (float) 36 | :param y_covariate_col: column name (str) of Y covariate (float) 37 | :param method: name (str) of method supported by pingouin.partial_corr 38 | :return: single-entry dict with {feat_col (str): [p (float), corr (float)]} 39 | """ 40 | results = partial_corr(data=df, 41 | x=feat_col, 42 | y=y_col, 43 | y_covar=y_covariate_col, 44 | x_covar=x_covariate_col, 45 | covar=covariate_col, 46 | method=method) 47 | corr = results['r'].item() 48 | p = results['p-val'].item() 49 | if np.isnan(p): 50 | p = 1.0 51 | corr = 0.0 52 | return {feat_col: [p, corr]} 53 | 54 | 55 | def evaluate_feature_cox(df, feat_col, covar): 56 | """ 57 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing columns "duration" (float) of survival time and "observed" (float) to delineate observed [1.0] or censored [0.0] outcomes 58 | :param feat_col: column name (str) of feature values (float) 59 | :param covar_col: column name (str) of XY covariate (float) 60 | :return: single-entry dict with {feat_col (str): [p (float), log(partial_hazard_ratio) (float)]} 61 | """ 62 | 63 | col_list = ['duration', 'observed', feat_col] 64 | if covar: 65 | col_list.append(covar) 66 | 67 | model = CoxPHFitter(penalizer=0.0) 68 | try: 69 | model.fit(df[col_list], duration_col='duration', event_col='observed') 70 | except (lifelines.exceptions.ConvergenceError, lifelines.exceptions.ConvergenceWarning) as e: 71 | try: 72 | model = CoxPHFitter(penalizer=0.2) 73 | model.fit(df[col_list], duration_col='duration', event_col='observed') 74 | except (lifelines.exceptions.ConvergenceError, lifelines.exceptions.ConvergenceWarning) as e: 75 | return {feat_col: [1.0, 0.0]} 76 | 77 | coef = model.summary.coef[feat_col] 78 | p = model.summary.p[feat_col] 79 | 80 | return {feat_col: [p, coef]} 81 | 82 | 83 | def evaluate_feature_concordance(df, feat_col, k_permutations=100): 84 | """ 85 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing columns "duration" (float) of survival time and "observed" (float) to delineate observed [1.0] or censored [0.0] outcomes 86 | :param feat_col: column name (str) of feature values (float) 87 | :param k_permutations: number of interations (int) for permutation test to assess statistical significance 88 | :return: single-entry dict with {feat_col (str): [p (float), log(partial_hazard_ratio) (float)]} 89 | """ 90 | 91 | c = lifelines.utils.concordance_index(event_times=df['duration'], 92 | predicted_scores=df[feat_col], 93 | event_observed=df['observed']) 94 | directional_deviance = c - 0.5 95 | absolute_deviance = np.abs(directional_deviance) 96 | 97 | random_absolute_deviances = [] 98 | for _ in range(k_permutations): 99 | random_c = lifelines.utils.concordance_index(event_times=df['duration'], 100 | predicted_scores=df[feat_col].sample(frac=1), 101 | event_observed=df['observed']) 102 | random_absolute_deviances.append(np.abs(random_c - 0.5)) 103 | p = (np.array(random_absolute_deviances) >= absolute_deviance).mean() 104 | return {feat_col: [p, directional_deviance]} 105 | 106 | 107 | def evaluate_features(df, feats_to_consider, method='kendall', covar=None, x_covar=None, y_covar=None, n_jobs=-1): 108 | """ 109 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome 110 | :param feats_to_consider: list of column names (str) identifying feature columns 111 | :param method: name (str) of method supported by pingouin.partial_corr, "cph" for Cox regression, or "c-index" for concordance assessment 112 | :param covar: column name (str) of XY covariate (float) 113 | :param x_covar: column name (str) of X covariate (float) 114 | :param y_covar: column name (str) of Y covariate (float) 115 | :param n_jobs: number of parallel jobs to run for feature assessment 116 | :return: pandas DataFrame with columns ['feat', 'p', 'stat'] 117 | """ 118 | assert not df.isna().sum().any() 119 | assert ('duration' in df.columns) and ('observed' in df.columns) 120 | 121 | if method == 'cph': 122 | assert (not x_covar) and (not y_covar) 123 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_cox) 124 | (df, effect, covar) for effect in feats_to_consider) 125 | elif method == 'c-index': 126 | assert (not covar) and (not x_covar) and (not y_covar) 127 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_concordance) 128 | (df, effect) for effect in feats_to_consider) 129 | elif method in ['pearson', 'spearman']: 130 | assert (not (covar and x_covar)) and (not (covar and y_covar)) 131 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_partial_correlation) 132 | (df[df.observed], effect, 'duration', covar, x_covar, y_covar, method) for effect in feats_to_consider) 133 | else: 134 | raise RuntimeError("Unknown method {}".format(method)) 135 | 136 | results = {} 137 | for dict_ in dicts: 138 | results.update(dict_) 139 | results = list(results.items()) 140 | 141 | feats = np.array([x[0] for x in results]) 142 | p_values = np.array([x[1][0] for x in results]) 143 | stat_values = np.array([x[1][1] for x in results]) 144 | results_df = pd.DataFrame({'feat': feats, 'p': p_values, 'stat': stat_values}) 145 | 146 | if method == 'response-agnostic': 147 | results_df['p'] = results_df.stat.apply(lambda x: (100 - percentileofscore(results_df.stat, x))/100) 148 | results_df = results_df.sort_values(by='p') 149 | 150 | return results_df 151 | 152 | 153 | def _get_features_to_consider(all_columns, args): 154 | """ 155 | :param all_columns: list of columns in input dataframe 156 | :param args: parsed arguments 157 | :return: list of actual feature-containing columns (excluding covariates and outcomes) 158 | """ 159 | columns_to_exclude = ['duration', 'observed', 'Unnamed: 0'] 160 | for col in [args.xy_covar, args.x_covar, args.y_covar, args.index_col]: 161 | if col: 162 | columns_to_exclude.append(col) 163 | return list(set(all_columns) - set(columns_to_exclude)) 164 | 165 | 166 | def preprocess_features(df, outlier_threshold, feature_names=None, scaler=None): 167 | """ 168 | :param df: input dataframe with features 169 | :outlier threshold: number of std devs above which we define an outlier 170 | :feature_names: list of feature names (str) 171 | :return: dataframe with processed features (removed outliers, scaled to zero mean, unit variance) 172 | """ 173 | if feature_names: 174 | x = df[feature_names].copy(deep=True) 175 | y = df.drop(columns=feature_names).copy(deep=True) 176 | else: 177 | x = df 178 | 179 | # x[(zscore(x.values.astype(float))) < -outlier_threshold] = -outlier_threshold 180 | # x[(zscore(x.values.astype(float))) > outlier_threshold] = outlier_threshold 181 | if outlier_threshold != -1: 182 | x[(np.abs(zscore(x.values.astype(float))) > outlier_threshold)] = np.nan 183 | x = x.fillna(x.median()) 184 | if not scaler: 185 | scaler = MinMaxScaler() 186 | # scaler = StandardScaler() 187 | # scaler = RobustScaler() 188 | x = pd.DataFrame(scaler.fit_transform(x.values), columns=x.columns, index=x.index) 189 | else: 190 | x = pd.DataFrame(scaler.transform(x.values), columns=x.columns, index=x.index) 191 | # scaler = RobustScaler() 192 | 193 | if outlier_threshold != -1: 194 | x[x>outlier_threshold]=outlier_threshold 195 | x[x<-outlier_threshold]=-outlier_threshold 196 | else: 197 | x[x>5]=5 198 | x[x<-5]=-5 199 | 200 | if feature_names: 201 | df = x.join(y) 202 | else: 203 | df = x 204 | return df, scaler 205 | 206 | 207 | def _get_x_axis_name_volcano(method): 208 | if method == 'kendall': 209 | x_axis_name = "Kendall's $\\tau$" 210 | elif method == 'spearman': 211 | x_axis_name = "Spearman's $\\rho$" 212 | elif method == 'pearson': 213 | x_axis_name = "Pearson's $\\rho$" 214 | elif method == 'response-agnostic': 215 | x_axis_name = "IQR" 216 | elif method == 'cph': 217 | x_axis_name = "log(Hazard ratio)" 218 | elif method == 'c-index': 219 | x_axis_name = "concordance (deviation from random)" 220 | else: 221 | raise RuntimeError("Cannot generate volcano plot x-axis name for method {}".format(method)) 222 | return x_axis_name 223 | 224 | 225 | def _make_results_df_pretty_for_plotting(df_, x_axis_name, modality, eps=1e-30): 226 | df_.loc[df_.p < eps, 'p'] = eps 227 | df = pd.DataFrame({'feature': df_.feat, 228 | '-log(p)': -np.log10(df_.p), 229 | x_axis_name: df_.stat}) 230 | if modality == 'radiology': 231 | try: 232 | for feat_name in df.feature: 233 | assert 'original-' not in feat_name 234 | assert 'diagnostic' not in feat_name 235 | except AssertionError: 236 | raise RuntimeError("Feature {} appears not to be a wavelet feature. The volcano plot color coding only supports wavelet feature.".format(feat_name)) 237 | 238 | df['abbreviated_feature'] = df.feature.str.replace('wavelet-', '') 239 | df['abbreviated_feature'] = df.abbreviated_feature.str.replace('log-sigma-1-0-mm-3D_', 'LoG_') 240 | df['abbreviated_feature'] = df.abbreviated_feature.str.replace('log-sigma-3-0-mm-3D_', 'LoG_') 241 | 242 | df['abbreviated_feature'] = df['abbreviated_feature'].apply(lambda x: '_'.join([x.split('_')[0], x.split('_')[-1]])) 243 | 244 | df.Matrix = 'Other' 245 | for matrix in ['glszm', 'ngtdm', 'glcm', 'glrlm', 'gldm']: #, 'firstorder' 246 | mask = df.feature.str.contains(matrix) 247 | count_ = mask.sum() 248 | df.loc[mask, 'Matrix'] = matrix 249 | else: 250 | df['Feature'] = 'Other' 251 | for feat_type in ['Tumor_Other', 'Tumor_Lymphocyte', 'Stroma_Other', 'Stroma_Lymphocyte', 'Tumor', 'Necrosis', 'Stroma', 'Fat']: 252 | mask = (df.feature.str.contains(feat_type) | df.feature.str.contains(feat_type.lower())) & (df['Feature'] == 'Other') 253 | count_ = mask.sum() 254 | # df.loc[mask, 'Matrix'] = '{} (n={})'.format(matrix, count_) 255 | df.loc[mask, 'Feature'] = feat_type.replace('_Other', ' Nuclei').replace('_Lymphocyte', ' Lymphocyte') 256 | df['abbreviated_feature'] = df['feature'].str.replace('_Other_', ' Nuclei ' ).str.replace('_Lymphocyte_', ' Lymphocyte ') 257 | 258 | df = df.sort_values(by='feature') 259 | return df 260 | 261 | def make_volcano_plot(df_, method, output_plot_path, modality, top_k_to_label=1): 262 | x_axis_name = _get_x_axis_name_volcano(method) 263 | 264 | df = _make_results_df_pretty_for_plotting(df_, x_axis_name, modality) 265 | df.loc[df[x_axis_name] < -3, x_axis_name] = -3 266 | df.loc[df[x_axis_name] > 3, x_axis_name] = 3 267 | 268 | if modality == 'radiology': 269 | hue_col = 'Matrix' 270 | else: 271 | hue_col = 'Feature' 272 | 273 | if modality == 'radiology': 274 | df = df.sort_values(by='-log(p)', ascending=True) 275 | 276 | # plt.rcParams["axes.labelsize"] = 13 277 | fig = plt.figure(figsize=(3, 2), constrained_layout=True) 278 | g = sns.scatterplot(data=df, 279 | x=x_axis_name, 280 | y='-log(p)', 281 | hue=hue_col, 282 | alpha=0.7, 283 | palette='dark', 284 | hue_order=['glcm', 'gldm', 'glrlm', 'glszm', 'ngtdm'] if modality == 'radiology' else None, 285 | s=6) 286 | plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title=hue_col) 287 | 288 | df = df.sort_values(by='-log(p)', ascending=False) 289 | ax = plt.gca() 290 | if modality == 'radiology': 291 | sig_threshold = get_ci_95_pval(df) 292 | if sig_threshold != -1: 293 | plt.axhline(y=sig_threshold, color='.2', linewidth=0.5, linestyle='-.') 294 | else: 295 | plt.gca().set_ylim((-0.20185706772068027, 4.2681049500679045)) 296 | plt.gca().spines['right'].set_visible(False) 297 | plt.gca().spines['top'].set_visible(False) 298 | if '.svg' in output_plot_path: 299 | plt.savefig(output_plot_path) 300 | else: 301 | plt.savefig(output_plot_path, dpi=300) 302 | plt.close() 303 | 304 | def get_ci_95_pval(df): 305 | p_vals = df['-log(p)'].apply(lambda x: float(10)**(-x)).tolist() 306 | reject, pvals_corrected, _, _ = multipletests(pvals=p_vals, alpha=0.05, method='fdr_bh', is_sorted=True) 307 | if reject.sum() == 0: 308 | return -1 309 | arg_ = np.argmax(pvals_corrected>=0.05) 310 | return -np.log10((p_vals[arg_ - 1] + p_vals[arg_])/2) 311 | 312 | 313 | if __name__ == '__main__': 314 | PARSER = ArgumentParser(description='select features for survival analysis') 315 | PARSER.add_argument('feature_df_path', type=str, 316 | help='path to pandas DataFrame with features and outcomes. all columns in the DF will be evaluated except "duration", "observed," and any covariates') 317 | PARSER.add_argument('--outcome_df_path', type=str, help='path to pd df with outcomes, optional', default=None) 318 | PARSER.add_argument('--train_id_df_path', type=str, help='path to pd df with train IDs, optional', default=None) 319 | 320 | PARSER.add_argument('--output_df_path', type=str, default='feature_evaluation.csv', help='path at which to save feature evaluation df') 321 | PARSER.add_argument('--output_plot_path', type=str, default='feature_evaluation.png', help='path at which to save feature evaluation volcano plot') 322 | PARSER.add_argument('--index_col', type=str, default='Patient ID', help="name of column to set as index") 323 | PARSER.add_argument('--outlier_std_threshold', type=float, default=5, help="number of standard deviations to use for clipping") 324 | PARSER.add_argument('--method', type=str, default='cph', 325 | help="'kendall' for Kendall's Tau, 'cph' for Cox Proportional Hazards, 'c-index' for Concordance, 'spearman,' 'response-agnostic', or 'pearson'") 326 | PARSER.add_argument('--xy_covar', type=str, default=None, help="XY covariate column name") 327 | PARSER.add_argument('--x_covar', type=str, default=None, help="X covariate column name") 328 | PARSER.add_argument('--y_covar', type=str, default=None, help="Y covariate column name") 329 | PARSER.add_argument('--n_jobs', type=int, default=-1, help="number of parallel jobs to use") 330 | PARSER.add_argument('--modality', type=str, default='radiology', help="radiology or pathology") 331 | ARGS = PARSER.parse_args() 332 | 333 | with open('../global_config.yaml', 'r') as f: 334 | CONFIGS = yaml.safe_load(f) 335 | DATA_DIR = CONFIGS['data_dir'] 336 | CODE_DIR = CONFIGS['code_dir'] 337 | 338 | DF = pd.read_csv(os.path.join(DATA_DIR, ARGS.feature_df_path)).set_index(ARGS.index_col) 339 | 340 | if ARGS.modality == 'radiology': 341 | DF = DF[[x for x in DF.columns if 'firstorder' not in x]] 342 | 343 | if ARGS.outcome_df_path: 344 | OUTCOME_DF = pd.read_csv(os.path.join(DATA_DIR, ARGS.outcome_df_path)).set_index(ARGS.index_col)[['duration.OS', 'observed.OS']] 345 | OUTCOME_DF = OUTCOME_DF.rename(columns={'duration.OS': 'duration', 'observed.OS': 'observed'}) 346 | DF = DF.join(OUTCOME_DF, how='inner') 347 | 348 | if ARGS.train_id_df_path: 349 | DF = DF[DF.index.isin(pd.read_csv(os.path.join(DATA_DIR, ARGS.train_id_df_path))[ARGS.index_col])] 350 | 351 | try: 352 | assert not DF.isna().sum().any() 353 | except AssertionError: 354 | raise RuntimeError("Input dataframe must not contain any NaN values.") 355 | 356 | FEATURE_NAMES = _get_features_to_consider(DF.columns.tolist(), ARGS) 357 | 358 | DF, _ = preprocess_features(DF, 359 | outlier_threshold=ARGS.outlier_std_threshold, 360 | feature_names=FEATURE_NAMES) 361 | 362 | RESULTS = evaluate_features(df=DF, 363 | feats_to_consider=FEATURE_NAMES, 364 | method=ARGS.method, 365 | covar=ARGS.xy_covar, 366 | x_covar=ARGS.x_covar, 367 | y_covar=ARGS.y_covar, 368 | n_jobs=ARGS.n_jobs) 369 | RESULTS.to_csv(ARGS.output_df_path) 370 | make_volcano_plot(RESULTS, ARGS.method, ARGS.output_plot_path, ARGS.modality) 371 | -------------------------------------------------------------------------------- /feature-selection/select_features.sh: -------------------------------------------------------------------------------- 1 | python select_features.py code/ct-feature-extraction/features/ct_features_omentum.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_ct_features_omentum.csv --output_plot_path results/hr_ct_features_omentum.png --modality radiology --method cph --train_id_df_path data/dataframes/train_ids.csv 2 | 3 | python select_features.py code/ct-feature-extraction/features/ct_features_ovary.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_ct_features_ovary.csv --output_plot_path results/hr_ct_features_ovary.png --modality radiology --method cph --train_id_df_path data/dataframes/train_ids.csv 4 | 5 | python select_features.py code/hne-feature-extraction/tissue_tile_features/reference_hne_features.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_hne_features.csv --output_plot_path results/hr_hne_features.png --modality pathology --method cph --xy_covar n_foreground_tiles --train_id_df_path data/dataframes/train_ids.csv 6 | 7 | -------------------------------------------------------------------------------- /global_config.yaml: -------------------------------------------------------------------------------- 1 | data_dir: 2 | code_dir: 3 | -------------------------------------------------------------------------------- /hne-feature-extraction/0_get_cohort_csv.sh: -------------------------------------------------------------------------------- 1 | python connector.py > ../data/dataframes/hne_df.csv 2 | -------------------------------------------------------------------------------- /hne-feature-extraction/1_infer_tissue_types_and_extract_features.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh 3 | conda activate transformer 4 | 5 | ARGS="--magnification 20 6 | --cohort_csv_path data/dataframes/hne_df.csv 7 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_hne_df.csv 8 | --tile_dir ../tissue-type-training/pretilings 9 | --tile_selection_type otsu 10 | --otsu_threshold 0.5 11 | --purple_threshold 0.05 12 | --batch_size 200 13 | --min_n_tiles 100 14 | --normalize 15 | --gpu 0 1 2 3 16 | --tile_size 128 17 | --model resnet18 18 | --checkpoint_path ../tissue-type-training/checkpoints/tissue_type_classifier_weights.torch" 19 | 20 | python ../tissue-type-training/preprocess.py ${ARGS} 21 | python ../tissue-type-training/pretile.py ${ARGS} 22 | python infer_tissue_tile_clf.py ${ARGS} 23 | python map_inference_to_bitmap.py ${ARGS} 24 | python extract_feats_from_bitmaps.py ${ARGS} 25 | -------------------------------------------------------------------------------- /hne-feature-extraction/1_infer_tissue_types_and_extract_features.sub: -------------------------------------------------------------------------------- 1 | universe = vanilla 2 | executable = 1_infer_tissue_types_and_extract_features.sh 3 | 4 | # requirements to specify the execution machine needs. 5 | #requirements = (CUDACapability >= 4) 6 | 7 | # "short", "medium", "long" for jobs lasting 8 | # ~12 hr, ~24 hr, ~7 days 9 | +GPUJobLength = "short" 10 | 11 | request_gpus = 4 12 | request_memory = 4GB 13 | request_cpus = 64 14 | #request_disk = 10MB 15 | 16 | output = $(Cluster)_$(Process).out 17 | log = $(Cluster)_$(Process).log 18 | error = $(Cluster)_$(Process).err 19 | 20 | # number of jobs to submit 21 | queue 1 22 | 23 | -------------------------------------------------------------------------------- /hne-feature-extraction/2_extract_objects.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh 3 | conda activate transformer 4 | for entry in "qupath/data/slides"/*.svs; do 5 | temp=`basename ${entry}` 6 | temp="${temp/.svs/.tsv}" 7 | temp="qupath/data/results/${temp}" 8 | echo ${temp} 9 | if test -f ${temp}; then 10 | echo "${temp} exists" 11 | else 12 | echo sh _run_gpu_qupath_stardist_singleSlide.sh `basename $entry` 13 | sh _run_gpu_qupath_stardist_singleSlide.sh `basename $entry` 14 | fi 15 | done 16 | -------------------------------------------------------------------------------- /hne-feature-extraction/2_extract_objects.sub: -------------------------------------------------------------------------------- 1 | universe = vanilla 2 | executable = _run_gpu_qupath_stardist_singleSlide.sh 3 | arguments = $(filename) 4 | 5 | # requirements to specify the execution machine needs. 6 | #requirements = (CUDACapability >= 4) 7 | 8 | # "short", "medium", "long" for jobs lasting 9 | # ~12 hr, ~24 hr, ~7 days 10 | +GPUJobLength = "short" 11 | 12 | request_gpus = 1 13 | request_memory = 36GB 14 | request_cpus = 10 15 | #request_disk = 10MB 16 | 17 | output = $(Cluster)_$(Process).out 18 | log = $(Cluster)_$(Process).log 19 | error = $(Cluster)_$(Process).err 20 | 21 | # number of jobs to submit 22 | queue filename from cut -d, -f2 ../data/dataframes/preprocessed_hne_df.csv | 23 | -------------------------------------------------------------------------------- /hne-feature-extraction/3_label_objects_and_extract_features.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh 4 | conda activate transformer 5 | ARGS=" 6 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_hne_df.csv 7 | --checkpoint_path ../tissue-type-training/checkponts/tissue_type_classifier_weights.torch" 8 | 9 | python merge_cells_and_regions.py ${ARGS} 10 | python extract_feats_from_object_detections.py ${ARGS} 11 | -------------------------------------------------------------------------------- /hne-feature-extraction/3_label_objects_and_extract_features.sub: -------------------------------------------------------------------------------- 1 | universe = vanilla 2 | executable = 3_label_objects_and_extract_features.sh 3 | 4 | # requirements to specify the execution machine needs. 5 | #requirements = (CUDACapability >= 4) 6 | 7 | # "short", "medium", "long" for jobs lasting 8 | # ~12 hr, ~24 hr, ~7 days 9 | +GPUJobLength = "short" 10 | 11 | #request_gpus = 0 12 | request_memory = 384 13 | request_cpus = 32 14 | #request_disk = 10MB 15 | 16 | output = $(Cluster)_$(Process).out 17 | log = $(Cluster)_$(Process).log 18 | error = $(Cluster)_$(Process).err 19 | 20 | # number of jobs to submit 21 | queue 1 22 | 23 | -------------------------------------------------------------------------------- /hne-feature-extraction/_run_gpu_qupath_stardist_singleSlide.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh 3 | conda activate transformer 4 | 5 | singularity run --env TF_FORCE_GPU_ALLOW_GROWTH=true,PER_PROCESS_GPU_MEMORY_FRACTION=0.8 -B $(dirname $1):/data/slides,qupath/data/results:/data/results,qupath/models:/models,qupath/scripts:/scripts --nv qupath/qupath-stardist_latest.sif java -Djava.awt.headless=true \ 6 | -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \ 7 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \ 8 | script --image /data/slides/$(basename $1) /scripts/stardist_nuclei_and_lymphocytes.groovy 9 | -------------------------------------------------------------------------------- /hne-feature-extraction/bitmaps/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/bitmaps/.gitkeep -------------------------------------------------------------------------------- /hne-feature-extraction/connector.py: -------------------------------------------------------------------------------- 1 | """connector.py. 2 | 3 | Interface for reading and manipulating tables from Dremio. 4 | 5 | Test with - 6 | $ python breastana/connector.py 7 | """ 8 | import urllib.parse 9 | from pathlib import Path 10 | from typing import Any, Callable, Optional 11 | import sys 12 | 13 | import pandas as pd 14 | from pyarrow import flight 15 | 16 | 17 | class TableLoader: 18 | """Generic table loading interface.""" 19 | 20 | def __init__(self, user: str, password: str, flight_port: int = 32010): 21 | self.mod_dict: dict[str, Any] = {} 22 | self.mod_mask = pd.DataFrame() 23 | self.user = user 24 | self.password = password 25 | self.flight_port = flight_port 26 | 27 | def load_from_dremio(self, url: Any) -> pd.DataFrame: 28 | """Load table from Dremio.""" 29 | dremio_session = DremioDataframeConnector( 30 | scheme="grpc+tcp", 31 | hostname=url.hostname, 32 | flightport=self.flight_port, 33 | dremio_user=self.user, 34 | dremio_password=self.password, 35 | connection_args={}, 36 | ) 37 | return dremio_session.get_table(url.query) 38 | 39 | 40 | class FeatureTableLoader(TableLoader): 41 | """Trying to make loading from dremio nicer.""" 42 | 43 | def add_table( 44 | self, 45 | table: str, 46 | feature_tag: str, 47 | index_cols: list[str] = ["main_index"], 48 | fx_transform: Optional[Callable[[Any], Any]] = None, 49 | check_uniqueness: bool = True, 50 | ) -> pd.DataFrame: 51 | """Add table to FeatureTableLoader object.""" 52 | if not type(table) == pd.DataFrame: 53 | path = Path(table) 54 | url = urllib.parse.urlparse(table) 55 | 56 | print(f"Info: {url}") 57 | 58 | if url.scheme == "file" and path.suffix == ".csv": 59 | df = pd.read_csv(path) 60 | if "dremio" in url.scheme: 61 | df = self.load_from_dremio(url) 62 | else: 63 | df = table.reset_index() 64 | 65 | df = df.set_index(index_cols) 66 | 67 | if fx_transform is not None: 68 | df = fx_transform(df) 69 | 70 | if not len(df.index.unique()) == len(df.index) and check_uniqueness: 71 | raise RuntimeError( 72 | "Feature tables must have unique indicies post-transform", 73 | f"N={len(df.index)-len(df.index.unique())} try something else!", 74 | ) 75 | 76 | self.mod_dict[feature_tag] = df 77 | 78 | return df 79 | 80 | def calculate_mask(self) -> pd.DataFrame: 81 | """Calculate index mask.""" 82 | for key in self.mod_dict.keys(): 83 | df = self[key] 84 | main_index = df.index.get_level_values(0) 85 | 86 | self.mod_mask = self.mod_mask.reindex(self.mod_mask.index.union(main_index)) 87 | self.mod_mask.loc[main_index, key] = True 88 | self.mod_mask = self.mod_mask.fillna(False) 89 | 90 | return self.mod_mask 91 | 92 | def __getitem__(self, key: str) -> Any: 93 | """Get item.""" 94 | return self.mod_dict[key] 95 | 96 | def __setitem__(self, key: str, item: Any) -> None: 97 | """Set item.""" 98 | self.mod_dict[key] = item 99 | 100 | 101 | class DremioClientAuthMiddlewareFactory(flight.ClientMiddlewareFactory): # type: ignore 102 | """A factory that creates DremioClientAuthMiddleware(s).""" 103 | 104 | def __init__(self) -> None: 105 | self.call_credential: list[Any] = [] 106 | 107 | def start_call(self, info: Any) -> Any: # type ignore: 108 | """Start call.""" 109 | return DremioClientAuthMiddleware(self) 110 | 111 | def set_call_credential(self, call_credential: list[Any]) -> None: 112 | """Set call credentials.""" 113 | self.call_credential = call_credential 114 | 115 | 116 | class DremioClientAuthMiddleware(flight.ClientMiddleware): # type: ignore 117 | """Dremio ClientMiddleware used for authentication. 118 | 119 | Extracts the bearer token from 120 | the authorization header returned by the Dremio 121 | Flight Server Endpoint. 122 | 123 | Parameters 124 | ---------- 125 | factory : ClientHeaderAuthMiddlewareFactory 126 | The factory to set call credentials if an 127 | authorization header with bearer token is 128 | returned by the Dremio server. 129 | """ 130 | 131 | def __init__(self, factory: DremioClientAuthMiddlewareFactory): 132 | self.factory = factory 133 | 134 | def received_headers(self, headers: dict[str, Any]) -> None: 135 | """Process header.""" 136 | auth_header_key = "authorization" 137 | authorization_header: list[Any] = [] 138 | for key in headers: 139 | if key.lower() == auth_header_key: 140 | authorization_header = headers.get(auth_header_key) # type: ignore 141 | self.factory.set_call_credential( 142 | [b"authorization", authorization_header[0].encode("utf-8")] 143 | ) 144 | 145 | 146 | class DremioDataframeConnector: 147 | """Dremio connector. 148 | 149 | Iterfaces with a Dremio instance/cluster 150 | via Apache Arrow Flight for fast read performance. 151 | 152 | Parameters 153 | ---------- 154 | scheme: connection scheme 155 | hostname: host of main dremio name 156 | flightport: which port dremio exposes to flight requests 157 | dremio_user: username to use 158 | dremio_password: associated password 159 | connection_args: anything else to pass to the FlightClient initialization 160 | """ 161 | 162 | def __init__( 163 | self, 164 | scheme: str, 165 | hostname: str, 166 | flightport: int, 167 | dremio_user: str, 168 | dremio_password: str, 169 | connection_args: dict[str, Any], 170 | ): 171 | # Skipping tls... 172 | 173 | # Two WLM settings can be provided upon initial authentication 174 | # with the Dremio Server Flight Endpoint: 175 | # - routing-tag 176 | # - routing queue 177 | initial_options = flight.FlightCallOptions( 178 | headers=[ 179 | (b"routing-tag", b"test-routing-tag"), 180 | (b"routing-queue", b"Low Cost User Queries"), 181 | ] 182 | ) 183 | client_auth_middleware = DremioClientAuthMiddlewareFactory() 184 | client = flight.FlightClient( 185 | f"{scheme}://{hostname}:{flightport}", 186 | middleware=[client_auth_middleware], 187 | **connection_args, 188 | ) 189 | self.bearer_token = client.authenticate_basic_token( 190 | dremio_user, dremio_password, initial_options 191 | ) 192 | self.client = client 193 | 194 | def run(self, project: str, table_name: str) -> pd.DataFrame: 195 | """Get a fixed table. 196 | 197 | Returns the virtual table at project(or "space").table_name 198 | as a pandas dataframe 199 | 200 | Parameters 201 | ---------- 202 | project: Project ID to read from 203 | table_name: Table name to load 204 | 205 | """ 206 | sqlquery = f'''SELECT * FROM "{project}"."{table_name}"''' 207 | 208 | # flight_desc = flight.FlightDescriptor.for_command(sqlquery) 209 | print("[INFO] Query: ", sqlquery) 210 | 211 | options = flight.FlightCallOptions(headers=[self.bearer_token]) 212 | # schema = self.client.get_schema(flight_desc, options) 213 | 214 | # Get the FlightInfo message to retrieve the Ticket corresponding 215 | # to the query result set. 216 | flight_info = self.client.get_flight_info( 217 | flight.FlightDescriptor.for_command(sqlquery), options 218 | ) 219 | 220 | # Retrieve the result set as a stream of Arrow record batches. 221 | reader = self.client.do_get(flight_info.endpoints[0].ticket, options) 222 | return reader.read_pandas() 223 | 224 | def get_table(self, sqlquery: str) -> pd.DataFrame: 225 | """Run a query. 226 | 227 | Returns the virtual table at project(or "space").table_name 228 | as a pandas dataframe. 229 | 230 | Parameters 231 | ---------- 232 | project: Project ID to read from 233 | table_name: Table name to load 234 | 235 | """ 236 | # flight_desc = flight.FlightDescriptor.for_command(sqlquery) 237 | print("[INFO] Query: ", sqlquery) 238 | 239 | options = flight.FlightCallOptions(headers=[self.bearer_token]) 240 | # schema = self.client.get_schema(flight_desc, options) 241 | 242 | # Get the FlightInfo message to retrieve the Ticket corresponding 243 | # to the query result set. 244 | flight_info = self.client.get_flight_info( 245 | flight.FlightDescriptor.for_command(sqlquery), options 246 | ) 247 | 248 | # Retrieve the result set as a stream of Arrow record batches. 249 | reader = self.client.do_get(flight_info.endpoints[0].ticket, options) 250 | 251 | return reader.read_pandas() 252 | 253 | 254 | if __name__ == "__main__": 255 | import getpass 256 | 257 | # set username and password 258 | # (or Personal Access Token) when prompted at the command prompt 259 | DREMIO_USER = input("Username: ") 260 | DREMIO_PASSWORD = getpass.getpass(prompt="Password or PAT: ", stream=None) 261 | 262 | dremio_session = DremioDataframeConnector( 263 | scheme="grpc+tcp", 264 | hostname="tlvidreamcord1", 265 | flightport=32010, 266 | dremio_user=DREMIO_USER, 267 | dremio_password=DREMIO_PASSWORD, 268 | connection_args={}, 269 | ) 270 | query = 'SELECT merged_hne_inventory.spectrum_sample_id, merged_hne_inventory.slide_image FROM merged_hne_inventory' 271 | 272 | df = dremio_session.get_table(query) 273 | df['merged_hne_inventory.slide_image'] = df['merged_hne_inventory.slide_image'].str.removeprefix("file://") 274 | df.to_csv(sys.stdout) 275 | -------------------------------------------------------------------------------- /hne-feature-extraction/environment.yml: -------------------------------------------------------------------------------- 1 | ../tissue-type-training/environment.yml -------------------------------------------------------------------------------- /hne-feature-extraction/extract_feats_from_bitmaps.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | from joblib import Parallel, delayed 5 | from PIL import Image 6 | from skimage import measure 7 | import sys 8 | sys.path.append('../tissue-type-training') 9 | import config 10 | 11 | 12 | def extract_fraction_necrosis(result_dict, class_bitmaps): 13 | necrotic_pixels = np.sum(class_bitmaps['Necrosis']) 14 | total_foreground_pixels = 0 15 | for class_, bitmap in class_bitmaps.items(): 16 | total_foreground_pixels += np.sum(bitmap) 17 | feature = float(necrotic_pixels) / total_foreground_pixels 18 | result_dict['fraction_area_necrotic'] = feature 19 | 20 | 21 | def extract_ratio_necrosis_to_tumor(result_dict, class_bitmaps): 22 | necrotic_pixels = np.sum(class_bitmaps['Necrosis']) 23 | tumor_pixels = np.sum(class_bitmaps['Tumor']) 24 | if tumor_pixels > 0: 25 | feature = float(necrotic_pixels) / tumor_pixels 26 | result_dict['ratio_necrosis_to_tumor'] = feature 27 | else: 28 | pass 29 | 30 | 31 | def extract_ratio_necrosis_to_stroma(result_dict, class_bitmaps): 32 | necrotic_pixels = np.sum(class_bitmaps['Necrosis']) 33 | stroma_pixels = np.sum(class_bitmaps['Stroma']) 34 | if stroma_pixels > 0: 35 | feature = float(necrotic_pixels) / stroma_pixels 36 | result_dict['ratio_necrosis_to_stroma'] = feature 37 | else: 38 | pass 39 | 40 | 41 | def extract_shannon_entropy(result_dict, class_bitmaps, prefix=None): 42 | n = 0 43 | p = [] 44 | for bitmap in class_bitmaps.values(): 45 | count = np.sum(bitmap != 0) 46 | n += count 47 | p.append(count) 48 | p = np.array(p) 49 | p = p/n 50 | 51 | shannon_entropy = 0 52 | for prob in p: 53 | if prob > 0: 54 | shannon_entropy -= prob * np.log2(prob) 55 | 56 | if prefix: 57 | key_name = '{}_shannon_entropy'.format(prefix) 58 | else: 59 | key_name = 'shannon_entropy' 60 | 61 | result_dict[key_name] = shannon_entropy 62 | 63 | 64 | def extract_tumor_stroma_entropy(result_dict, class_bitmaps_): 65 | class_bitmaps = {'Tumor': class_bitmaps_['Tumor'], 66 | 'Stroma': class_bitmaps_['Stroma']} 67 | return extract_shannon_entropy(result_dict, class_bitmaps, prefix='Tumor_Stroma') 68 | 69 | 70 | def get_classwise_regionprops(result_dict, class_bitmaps): 71 | for class_, bitmap in class_bitmaps.items(): 72 | features = _get_single_class_regionprops(bitmap, class_) 73 | result_dict.update(features) 74 | 75 | # get regionprops for largest connected component 76 | largest_cc_map = _get_single_class_largest_cc_bitmap(bitmap) 77 | if largest_cc_map is not None: 78 | features = _get_single_class_regionprops(largest_cc_map, '_'.join([class_, 79 | 'largest_component'])) 80 | result_dict.update(features) 81 | 82 | 83 | def _get_single_class_regionprops(class_bitmap, class_label): 84 | features = {} 85 | properties = measure.regionprops(class_bitmap) 86 | if properties: 87 | properties = properties[0] 88 | else: 89 | return {'_'.join([class_label, 'area']): 0} 90 | 91 | features['_'.join([class_label, 'area'])] = properties.area 92 | features['_'.join([class_label, 'convex_area'])] = properties.convex_area 93 | features['_'.join([class_label, 'eccentricity'])] = properties.eccentricity 94 | features['_'.join([class_label, 'equivalent_diameter'])] = properties.equivalent_diameter 95 | features['_'.join([class_label, 'euler_number'])] = properties.euler_number 96 | features['_'.join([class_label, 'extent'])] = properties.extent 97 | # features['_'.join([class_label, 'feret_diameter_max'])] = properties.feret_diameter_max 98 | # features['_'.join([class_label, 'filled_area'])] = properties.filled_area 99 | features['_'.join([class_label, 'major_axis_length'])] = properties.major_axis_length 100 | features['_'.join([class_label, 'minor_axis_length'])] = properties.minor_axis_length 101 | features['_'.join([class_label, 'perimeter'])] = properties.perimeter 102 | # features['_'.join([class_label, 'perimeter_crofton'])] = properties.perimeter_crofton 103 | features['_'.join([class_label, 'solidity'])] = properties.solidity 104 | features['_'.join([class_label, 'PA_ratio'])] = properties.perimeter / float(properties.area) 105 | return features 106 | # print(features) 107 | # exit() 108 | 109 | 110 | def _get_single_class_largest_cc_bitmap(bitmap): 111 | labels, n = measure.label(bitmap, return_num=True) 112 | 113 | largest_area = 0 114 | associated_label = -1 115 | for label in range(1, n): 116 | area = np.sum(labels == label) 117 | if area > largest_area: 118 | largest_area = area 119 | associated_label = label 120 | if associated_label == -1: 121 | return None 122 | else: 123 | return (labels == label).astype(int) 124 | 125 | 126 | def extract_feats(dir_, slide_name, class_list): 127 | class_bitmaps = dict() 128 | for class_ in class_list: 129 | class_bitmaps[class_] = np.array(Image.open(os.path.join(dir_, slide_name, class_ + '.png'))).squeeze() 130 | 131 | result_ = dict() 132 | get_classwise_regionprops(result_, class_bitmaps) 133 | extract_fraction_necrosis(result_, class_bitmaps) 134 | extract_ratio_necrosis_to_tumor(result_, class_bitmaps) 135 | extract_ratio_necrosis_to_stroma(result_, class_bitmaps) 136 | extract_shannon_entropy(result_, class_bitmaps) 137 | extract_tumor_stroma_entropy(result_, class_bitmaps) 138 | return {slide_name: result_} 139 | 140 | 141 | if __name__ == '__main__': 142 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '') 143 | bitmap_dir = 'bitmaps/{}'.format(checkpoint_name) 144 | feat_df_filename = 'tissue_tile_features/{}.csv'.format(checkpoint_name) 145 | SERIAL = False 146 | 147 | slide_list = os.listdir(bitmap_dir) 148 | 149 | map_key = {'Stroma': 0, 150 | 'Tumor': 1, 151 | 'Fat': 2, 152 | 'Necrosis': 3} 153 | classes = list(map_key.keys()) 154 | 155 | results = {} 156 | if SERIAL: 157 | for slide in slide_list: 158 | print(slide) 159 | result = extract_feats(bitmap_dir, slide, classes) 160 | results.update(result) 161 | else: 162 | dicts = Parallel(n_jobs=32)(delayed(extract_feats)(bitmap_dir, slide, classes) for slide in slide_list) 163 | for dict_ in dicts: 164 | results.update(dict_) 165 | 166 | df = pd.DataFrame(results).T 167 | print(df) 168 | 169 | df = df.reset_index().rename(columns={'index': 'image_id'}) 170 | df.to_csv(feat_df_filename, index=False) 171 | 172 | -------------------------------------------------------------------------------- /hne-feature-extraction/extract_feats_from_object_detections.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | import yaml 5 | from joblib import Parallel, delayed 6 | import sys 7 | sys.path.append('../tissue-type-training') 8 | import config 9 | 10 | 11 | def get_density(obj_feats, regional_feats, result_dict, parent, class_): 12 | parent_area = regional_feats['{}_area'.format(parent)].item() * 64 # scale factor for 1/16 downsampling and 0.5 µm / pixel 13 | # print(parent_area) 14 | obj_mask = (obj_feats.Parent == parent) & (obj_feats.Class == class_) 15 | # print(obj_mask) 16 | obj_count = obj_mask.sum() 17 | # print(obj_count) 18 | key_ = '{}_{}_density'.format(parent, class_) 19 | try: 20 | result_dict[key_] = float(obj_count) / parent_area 21 | except ZeroDivisionError: 22 | result_dict[key_] = np.nan 23 | 24 | 25 | def get_quantiles(object_feats, mask, feat, output_feat_name, results_dict): 26 | results_dict[output_feat_name.format('mean')] = object_feats.loc[mask, feat].mean() 27 | for quantile in np.arange(0.1, 1, 0.1): 28 | results_dict[output_feat_name.format('quantile{:2.1f}'.format(quantile))] = object_feats.loc[mask, feat].quantile(quantile) 29 | results_dict[output_feat_name.format('var')] = object_feats.loc[mask, feat].var() 30 | results_dict[output_feat_name.format('skew')] = object_feats.loc[mask, feat].skew() 31 | results_dict[output_feat_name.format('kurtosis')] = object_feats.loc[mask, feat].kurtosis() 32 | 33 | 34 | def extract_feats(object_feat_fn, regional_feature_df_, slide_id): 35 | regional_feature_df = regional_feature_df_[regional_feature_df_.index.astype(str) == str(slide_id)] 36 | if len(regional_feature_df) != 1: 37 | return {} 38 | 39 | result_ = {} 40 | object_feats = pd.read_csv(object_feat_fn) 41 | if len(object_feats.columns) == 1: 42 | object_feats = pd.read_csv(object_feat_fn, delimiter='\t') 43 | object_feats = object_feats[object_feats['Detection probability'] > DETECTION_PROB_THRESHOLD] 44 | # print(object_feats.Parent.value_counts()) 45 | get_density(object_feats, 46 | regional_feature_df, 47 | result_, 48 | parent='Tumor', 49 | class_='Lymphocyte' 50 | ) 51 | get_density(object_feats, 52 | regional_feature_df, 53 | result_, 54 | parent='Tumor', 55 | class_='Other' 56 | ) 57 | get_density(object_feats, 58 | regional_feature_df, 59 | result_, 60 | parent='Necrosis', 61 | class_='Other' 62 | ) 63 | get_density(object_feats, 64 | regional_feature_df, 65 | result_, 66 | parent='Stroma', 67 | class_='Lymphocyte' 68 | ) 69 | get_density(object_feats, 70 | regional_feature_df, 71 | result_, 72 | parent='Stroma', 73 | class_='Other' 74 | ) 75 | get_quantiles(object_feats=object_feats, 76 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 77 | feat='Nucleus: Area µm^2', 78 | output_feat_name='Tumor_Other_{}_nuclear_area', 79 | results_dict=result_) 80 | 81 | get_quantiles(object_feats=object_feats, 82 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 83 | feat='Nucleus: Circularity', 84 | output_feat_name='Tumor_Other_{}_nuclear_circularity', 85 | results_dict=result_) 86 | 87 | get_quantiles(object_feats=object_feats, 88 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 89 | feat='Nucleus: Solidity', 90 | output_feat_name='Tumor_Other_{}_nuclear_solidity', 91 | results_dict=result_) 92 | 93 | get_quantiles(object_feats=object_feats, 94 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 95 | feat='Nucleus: Max diameter µm', 96 | output_feat_name='Tumor_Other_{}_nuclear_max_diameter', 97 | results_dict=result_) 98 | 99 | get_quantiles(object_feats=object_feats, 100 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 101 | feat='Hematoxylin: Nucleus: Mean', 102 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_mean', 103 | results_dict=result_) 104 | 105 | get_quantiles(object_feats=object_feats, 106 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 107 | feat='Hematoxylin: Nucleus: Median', 108 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_median', 109 | results_dict=result_) 110 | 111 | get_quantiles(object_feats=object_feats, 112 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 113 | feat='Hematoxylin: Nucleus: Min', 114 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_min', 115 | results_dict=result_) 116 | 117 | get_quantiles(object_feats=object_feats, 118 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 119 | feat='Hematoxylin: Nucleus: Max', 120 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_max', 121 | results_dict=result_) 122 | 123 | get_quantiles(object_feats=object_feats, 124 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 125 | feat='Hematoxylin: Nucleus: Std.Dev.', 126 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_stdDev', 127 | results_dict=result_) 128 | 129 | get_quantiles(object_feats=object_feats, 130 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 131 | feat='Eosin: Nucleus: Mean', 132 | output_feat_name='Tumor_Other_{}_nuclear_eosin_mean', 133 | results_dict=result_) 134 | 135 | get_quantiles(object_feats=object_feats, 136 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 137 | feat='Eosin: Nucleus: Median', 138 | output_feat_name='Tumor_Other_{}_nuclear_eosin_median', 139 | results_dict=result_) 140 | 141 | get_quantiles(object_feats=object_feats, 142 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 143 | feat='Eosin: Nucleus: Min', 144 | output_feat_name='Tumor_Other_{}_nuclear_eosin_min', 145 | results_dict=result_) 146 | 147 | get_quantiles(object_feats=object_feats, 148 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 149 | feat='Eosin: Nucleus: Max', 150 | output_feat_name='Tumor_Other_{}_nuclear_eosin_max', 151 | results_dict=result_) 152 | 153 | get_quantiles(object_feats=object_feats, 154 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'), 155 | feat='Eosin: Nucleus: Std.Dev.', 156 | output_feat_name='Tumor_Other_{}_nuclear_eosin_stdDev', 157 | results_dict=result_) 158 | 159 | try: 160 | result_['ratio_Tumor_Lymphocyte_to_Tumor_Other'] = float( 161 | ((object_feats.Parent == 'Tumor') & (object_feats.Class == 'Lymphocyte')).sum()) / float( 162 | ((object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other')).sum() 163 | ) 164 | except ZeroDivisionError: 165 | result_['ratio_Tumor_Lymphocyte_to_Tumor_Other'] = np.nan 166 | 167 | return {int(slide_id): result_} 168 | 169 | 170 | if __name__ == '__main__': 171 | 172 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '') 173 | 174 | regional_feat_df_filename = 'tissue_tile_features/{}.csv'.format(checkpoint_name) 175 | 176 | object_detection_dir = 'final_objects/{}'.format(checkpoint_name) 177 | 178 | merged_feat_df_filename = 'tissue_tile_features/{}_merged.csv'.format(checkpoint_name) 179 | SERIAL = True 180 | DETECTION_PROB_THRESHOLD = 0.5 181 | 182 | slide_list = [x for x in os.listdir(object_detection_dir) if (('.csv' in x) or ('.tsv' in x))] 183 | regional_feat_df = pd.read_csv(regional_feat_df_filename).set_index('image_id') 184 | 185 | results = {} 186 | if SERIAL: 187 | for slide in slide_list: 188 | print(slide) 189 | result = extract_feats(os.path.join(object_detection_dir, slide), regional_feat_df, slide[:-4]) 190 | results.update(result) 191 | else: 192 | dicts = Parallel(n_jobs=32)(delayed(extract_feats)(os.path.join(object_detection_dir, slide), regional_feat_df, slide[:-4]) for slide in slide_list) 193 | for dict_ in dicts: 194 | results.update(dict_) 195 | 196 | df = pd.DataFrame(results).T 197 | df.index = df.index.astype('str') 198 | df = df.join(regional_feat_df, how='inner') 199 | print(df) 200 | 201 | df = df.reset_index().rename(columns={'index': 'image_id'}) 202 | df['image_id'] = df['image_id'].astype(str) 203 | 204 | hne_df = pd.read_csv(config.args.preprocessed_cohort_csv_path) 205 | hne_df['image_id'] = hne_df['image_path'].apply(lambda x: x.split('/')[-1][:-4]).astype(str) 206 | df = df.join(hne_df[['image_id', 'Patient ID', 'n_foreground_tiles']].set_index('image_id'), on='image_id', how='left') 207 | df = df.drop(columns=['image_id']) 208 | df['Patient ID'] = df['Patient ID'].astype(str).apply(lambda x: x.zfill(3)) 209 | df = df.fillna(df.median()) 210 | df.to_csv(merged_feat_df_filename, index=False) 211 | 212 | -------------------------------------------------------------------------------- /hne-feature-extraction/final_objects/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/final_objects/.gitkeep -------------------------------------------------------------------------------- /hne-feature-extraction/hne-feature-extraction.dag: -------------------------------------------------------------------------------- 1 | JOB FIRST 1_infer_tissue_types_and_extract_features.sub 2 | JOB SECOND 2_extract_objects.sub 3 | JOB THIRD 3_label_objects_and_extract_features.sub 4 | PARENT FIRST CHILD SECOND 5 | PARENT SECOND CHILD THIRD 6 | -------------------------------------------------------------------------------- /hne-feature-extraction/infer_tissue_tile_clf.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import torch.nn as nn 3 | import numpy as np 4 | import torch 5 | import csv 6 | import os 7 | from torch.utils.data import DataLoader 8 | import torch.multiprocessing as mp 9 | 10 | import sys 11 | sys.path.append('../tissue-type-training') 12 | from general_utils import setup, get_val_transforms 13 | import config 14 | from models import load_tissue_tile_net 15 | from train_tissue_tile_clf import prep_df, make_preds 16 | from dataset import TissueTileDataset 17 | 18 | 19 | def make_preds_by_slide(model, df, device, file_dir, n_classes): 20 | header = ['tile_file_name', 'label'] 21 | header.extend(['score_{}'.format(k) for k in range(n_classes)]) 22 | 23 | for image_path, sub_df in df.groupby('image_path'): 24 | dataset = TissueTileDataset(df=sub_df, 25 | tile_dir=config.args.tile_dir, 26 | transforms=get_val_transforms()) 27 | loader = DataLoader(dataset, 28 | batch_size=config.args.batch_size, 29 | num_workers=4) 30 | 31 | file_name = os.path.join(file_dir, image_path.split('/')[-1][:-4] + '.csv') 32 | 33 | with open(file_name, 'w', newline='') as file: 34 | writer = csv.writer(file, delimiter=',') 35 | writer.writerow(header) 36 | 37 | with torch.no_grad(): 38 | for ids, tiles, labels in loader: 39 | preds = model(tiles.to(device)) 40 | preds = preds.detach().cpu().tolist() 41 | for idx, label, pred_list in zip(ids, labels.tolist(), preds): 42 | row = [idx, label] 43 | row.extend(pred_list) 44 | writer.writerow(row) 45 | 46 | 47 | def get_fold_slides(df, world_size, rank): 48 | all_slides = df.image_path.unique() 49 | chunks = np.array_split(all_slides, world_size) 50 | return chunks[rank] 51 | 52 | 53 | def distribute(rank, world_size, df_, n_classes, val_dir): 54 | setup(rank, world_size) 55 | device_ids = [config.args.gpu[rank]] 56 | device = torch.device('cuda:{}'.format(device_ids[0])) 57 | print('distributed to device {}'.format(str(device))) 58 | 59 | df = df_[df_.image_path.isin(get_fold_slides(df_, world_size, rank))] 60 | 61 | model = load_tissue_tile_net(config.args.checkpoint_path, activation=nn.Softmax(dim=1), n_classes=n_classes) 62 | model.to(device) 63 | model.eval() 64 | 65 | make_preds_by_slide(model, 66 | df, 67 | device, 68 | val_dir, 69 | n_classes) 70 | 71 | 72 | def serialize(df, n_classes, val_dir): 73 | model = load_tissue_tile_net(config.args.checkpoint_path, activation=nn.Softmax(), n_classes=n_classes) 74 | 75 | device = torch.device('cuda:{}'.format(config.args.gpu[0])) 76 | model.to(device) 77 | model.eval() 78 | 79 | make_preds_by_slide(model, 80 | df, 81 | device, 82 | val_dir, 83 | n_classes) 84 | 85 | 86 | if __name__ == '__main__': 87 | assert config.args.checkpoint_path 88 | 89 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '') 90 | inference_dir = 'inference/{}'.format(checkpoint_name) 91 | if not os.path.exists(inference_dir): 92 | os.mkdir(inference_dir) 93 | 94 | world_size_ = len(config.args.gpu) 95 | df, n_classes, _, _ = prep_df(config.args.preprocessed_cohort_csv_path, tile_dir=config.args.tile_dir, map_classes=False) 96 | 97 | if world_size_ == 1: 98 | serialize(df, n_classes, inference_dir) 99 | else: 100 | mp.spawn(distribute, 101 | args=(world_size_, df, n_classes, inference_dir), 102 | nprocs=world_size_, 103 | join=True) 104 | -------------------------------------------------------------------------------- /hne-feature-extraction/inference/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/inference/.gitkeep -------------------------------------------------------------------------------- /hne-feature-extraction/map_inference_to_bitmap.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | import yaml 5 | from joblib import Parallel, delayed 6 | from openslide import OpenSlide 7 | from openslide.lowlevel import OpenSlideUnsupportedFormatError 8 | from skimage.draw import rectangle_perimeter, rectangle 9 | from PIL import Image 10 | import sys 11 | sys.path.append('../tissue-type-training') 12 | import general_utils 13 | import config 14 | 15 | def convert_to_bitmap(slide_path, bitmap_dir, inference_dir, scale, map_key): 16 | slide_id = slide_path.split('/')[-1][:-4] 17 | slide_bitmap_subdir = os.path.join(bitmap_dir, slide_id) 18 | if not os.path.exists(slide_bitmap_subdir): 19 | os.mkdir(slide_bitmap_subdir) 20 | map_reverse_key = dict([(v, k) for k, v in map_key.items()]) 21 | 22 | # load thumbnail 23 | slide = OpenSlide(slide_path) 24 | 25 | slide_mag = general_utils.get_magnification(slide) 26 | scale = general_utils.adjust_scale_for_slide_mag(slide_mag=slide_mag, 27 | desired_mag=config.args.magnification, 28 | scale=scale) 29 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale) 30 | overlap = int(config.args.overlap // scale) 31 | 32 | # create bitmaps 33 | bitmaps = {} 34 | for key, val in map_key.items(): 35 | bitmaps[key] = np.zeros(thumbnail.shape[:2], dtype=np.uint8) 36 | 37 | # load tile class inference csv, create tile_address and predicted_class column 38 | df = pd.read_csv(os.path.join(inference_dir, slide_id + '.csv')) 39 | df['predicted_class'] = df.drop(columns=['label', 'tile_file_name']).idxmax(axis='columns').str.replace( 40 | 'score_', '').astype(int) 41 | df['address'] = df['tile_file_name'].apply( 42 | lambda x: [int(y) for y in x.replace('.png', '').split('/')[1].split('_')]) 43 | # for each tile, populate the associated area in the bitmap with pred_class 44 | 45 | generator, generator_level = general_utils.get_full_resolution_generator( 46 | general_utils.array_to_slide(thumbnail), 47 | tile_size=config.desired_otsu_thumbnail_tile_size, 48 | overlap=overlap) 49 | 50 | for address, class_number in zip(df.address, df.predicted_class): 51 | extent = generator.get_tile_dimensions(generator_level, address) 52 | start = (address[1] * config.desired_otsu_thumbnail_tile_size, 53 | address[0] * config.desired_otsu_thumbnail_tile_size) 54 | 55 | class_label = map_reverse_key[class_number] 56 | _thumbnail = bitmaps[class_label] 57 | rr, cc = rectangle(start=start, extent=extent, shape=_thumbnail.shape) 58 | _thumbnail[rr, cc] = 255 59 | 60 | # save bitmaps 61 | for class_label, bitmap in bitmaps.items(): 62 | _thumbnail = Image.fromarray(bitmap) 63 | _thumbnail.save(os.path.join(slide_bitmap_subdir, class_label + '.png')) 64 | 65 | # generate and save overlay 66 | vals = list(map_key.values()) 67 | range_ = [np.min(vals), np.max(vals)] 68 | thumbnail = general_utils.visualize_tile_scoring(thumbnail, 69 | config.desired_otsu_thumbnail_tile_size, 70 | df.address.tolist(), 71 | df.predicted_class.tolist(), 72 | overlap=overlap, 73 | range_=range_) 74 | thumbnail = Image.fromarray(thumbnail) 75 | thumbnail = general_utils.label_image_tissue_type(thumbnail, map_key) 76 | thumbnail.save(os.path.join(slide_bitmap_subdir, '_overlay.png')) 77 | 78 | 79 | if __name__ == '__main__': 80 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '') 81 | inference_dir = 'inference/{}'.format(checkpoint_name) 82 | bitmap_dir = 'bitmaps/{}'.format(checkpoint_name) 83 | if not os.path.exists(bitmap_dir): 84 | os.mkdir(bitmap_dir) 85 | 86 | df = pd.read_csv(config.args.preprocessed_cohort_csv_path) 87 | with open('../global_config.yaml', 'r') as f: 88 | DIRECTORIES = yaml.safe_load(f) 89 | DATA_DIR = DIRECTORIES['data_dir'] 90 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x)) 91 | 92 | scale_factor = config.args.tile_size / config.desired_otsu_thumbnail_tile_size 93 | 94 | map_key = {'Stroma': 0, 95 | 'Tumor': 1, 96 | 'Fat': 2, 97 | 'Necrosis': 3} 98 | 99 | for slide_path in df['image_path']: 100 | print(slide_path) 101 | convert_to_bitmap(slide_path, bitmap_dir, inference_dir, scale_factor, map_key) 102 | 103 | # Parallel(n_jobs=32)(delayed(convert_to_bitmap)(slide, bitmap_dir, inference_dir, scale_factor, map_key) for slide in slide_list) 104 | -------------------------------------------------------------------------------- /hne-feature-extraction/merge_cells_and_regions.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | from PIL import Image 4 | import numpy as np 5 | from joblib import Parallel, delayed 6 | import sys 7 | sys.path.append('../tissue-type-training') 8 | import config 9 | 10 | 11 | def visualize_cell_detections(df_, size, fn): 12 | marker_size = 2 13 | max_cells = 1000 14 | arr = np.zeros(size).astype(np.uint8) 15 | arr[:,:] = 0 16 | if len(df_) > max_cells: 17 | df = df_.sample(max_cells) 18 | else: 19 | df = df_.copy() 20 | for index, row in df.iterrows(): 21 | arr[(row.CentroidY-marker_size):(row.CentroidY+marker_size), (row.CentroidX-marker_size):(row.CentroidX+marker_size)] = 255 22 | img = Image.fromarray(arr) 23 | img.save(fn) 24 | 25 | # now visualize by class 26 | map_ = {'Tumor': np.array([0, 255, 0]), 27 | 'Stroma': np.array([0, 191, 255]), 28 | 'Necrosis': np.array([255, 0, 0]), 29 | 'Fat': np.array([255, 255, 0]), 30 | 'Unknown': np.array([255, 255, 255])} 31 | arr = np.zeros((size[0], size[1], 3)).astype(np.uint8) 32 | for index, row in df.iterrows(): 33 | if row.Parent == 'Unknown': 34 | continue 35 | arr[(row.CentroidY-marker_size):(row.CentroidY+marker_size), (row.CentroidX-marker_size):(row.CentroidX+marker_size)] = map_[row.Parent] 36 | img = Image.fromarray(arr) 37 | img.save(fn.replace('.png', '_classified.png')) 38 | 39 | 40 | def load_tissue_maps(_dir, _classes): 41 | bbox_map = None 42 | d = {} 43 | for class_ in _classes: 44 | img_fn = os.path.join(_dir, class_ + '.png') 45 | if not os.path.exists(img_fn): 46 | return None, None 47 | img = Image.open(img_fn) 48 | img = np.array(img) 49 | img = (img != 0).astype(np.uint8) 50 | if bbox_map is None: 51 | bbox_map = img.astype(bool) 52 | else: 53 | bbox_map = bbox_map.astype(bool) | img.astype(bool) 54 | d[class_] = img 55 | img_size = img.shape 56 | #print(d[class_].shape) 57 | #print(np.argmin(bbox_map, axis=0).min()) 58 | #print(np.argmax(bbox_map, axis=0).max()) 59 | #print(np.argmin(bbox_map, axis=1).min()) 60 | #print(np.argmax(bbox_map, axis=1).max()) 61 | return d, img_size 62 | 63 | 64 | def load_cell_detections(fn, scale_factor, shape): 65 | object_detections = pd.read_csv(fn, delimiter='\t') 66 | #print(object_detections) 67 | object_detections = object_detections.rename(columns={list(object_detections.columns)[5]: 'CentroidX', list(object_detections.columns)[6]: 'CentroidY'}) 68 | object_detections['CentroidX'] =(object_detections['CentroidX'] // scale_factor).astype(int) 69 | object_detections['CentroidY'] = (object_detections['CentroidY'] // scale_factor).astype(int) 70 | #print(object_detections['CentroidX'].describe()) 71 | #print(object_detections['CentroidY'].describe()) 72 | object_detections.loc[object_detections.CentroidX >= shape[1], 'CentroidX'] = shape[1] - 1 73 | object_detections.loc[object_detections.CentroidY >= shape[0], 'CentroidY'] = shape[0] - 1 74 | object_detections['Parent'] = 'Unknown' 75 | return object_detections 76 | 77 | 78 | def assign_cell_parents(object_detections, tissue_regional_maps): 79 | for class_, tissue_map in tissue_regional_maps.items(): 80 | #print(class_) 81 | object_detections = object_detections.apply(lambda x: _assign_single_class(x, tissue_map, class_), axis=1) 82 | return object_detections 83 | 84 | 85 | def _assign_single_class(row, tissue_map, tissue_type_name): 86 | tissue_map_val = tissue_map[row.CentroidY, row.CentroidX] 87 | if tissue_map_val != 0: 88 | row['Parent'] = tissue_type_name 89 | return row 90 | 91 | def process_slide(slide_id): 92 | tissue_maps, img_size = load_tissue_maps(os.path.join(region_detection_dir, slide_id.split('.')[0]), classes) 93 | output_fn = os.path.join(output_dir, slide_id + '.csv') 94 | if os.path.exists(output_fn): 95 | print("{} exists; skipping".format(output_fn)) 96 | return 97 | 98 | if tissue_maps is None: 99 | print('{} has no tissue_maps; skipping'.format(slide_id)) 100 | return 101 | 102 | if True: #try: 103 | cell_detections = load_cell_detections(os.path.join(object_detection_dir, slide_id + '.tsv'), object_coords_to_region_coords_scale_factor, img_size) 104 | cell_detections = assign_cell_parents(cell_detections, tissue_maps) 105 | #print(cell_detections.Parent.value_counts()) 106 | print('processed {}'.format(slide_id)) 107 | visualize_cell_detections(cell_detections, img_size, os.path.join(VIZ_DIR, slide_id + '.png')) 108 | cell_detections = cell_detections[cell_detections.Parent != 'Unknown'] 109 | cell_detections.to_csv(output_fn, index=False) 110 | 111 | 112 | checkpoint_id = config.args.checkpoint_path.split('/')[-1].replace('.torch', '') 113 | object_detection_dir = 'qupath/data/results' 114 | region_detection_dir = 'bitmaps/{}'.format(checkpoint_id) 115 | 116 | slide_ids = [x[:-4] for x in os.listdir(object_detection_dir) if '.tsv' in x] 117 | classes = ['Tumor', 'Stroma', 'Fat', 'Necrosis'] 118 | 119 | # bitmaps are generated at 4/128 = 1/32 resolution in pixel coordinates 120 | # cells are detected in µmcoordinates at full resolution. 1 pixel = 05um 121 | object_coords_to_region_coords_scale_factor = 16.096 122 | 123 | output_dir = 'final_objects/{}'.format(checkpoint_id) 124 | if not os.path.exists(output_dir): 125 | os.mkdir(output_dir) 126 | 127 | VIZ_DIR = 'visualizations/{}'.format(checkpoint_id) 128 | if not os.path.exists(VIZ_DIR): 129 | os.mkdir(VIZ_DIR) 130 | 131 | 132 | if __name__ == '__main__': 133 | Parallel(n_jobs=64)(delayed(process_slide)(slide_id) for slide_id in slide_ids) 134 | #for slide_id in slide_ids: 135 | # process_slide(slide_id) 136 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/Makefile: -------------------------------------------------------------------------------- 1 | export MYUID := $(shell id -u) 2 | export MYGID := $(shell id -g) 3 | export SINGULARITY_CACHEDIR := $(PWD)/.singularity/cache 4 | export SINGULARITY_TMPDIR := $(PWD)/.singularity/tmp 5 | export SINGULARITY_LOCALCACHEDIR := $(PWD)/.singularity/lcache 6 | IMAGE = docker://druvpatel/qupath-stardist:latest 7 | 8 | .PHONY: all help build clean test run 9 | .DEFAULT_GOAL := help 10 | 11 | # help menu 12 | define BROWSER_PYSCRIPT 13 | import os, webbrowser, sys 14 | 15 | try: 16 | from urllib import pathname2url 17 | except: 18 | from urllib.request import pathname2url 19 | 20 | webbrowser.open("file://" + pathname2url(os.path.abspath(sys.argv[1]))) 21 | endef 22 | export BROWSER_PYSCRIPT 23 | 24 | define PRINT_HELP_PYSCRIPT 25 | import re, sys 26 | 27 | for line in sys.stdin: 28 | match = re.match(r'^([a-zA-Z_-]+):.*?## (.*)$$', line) 29 | if match: 30 | target, help = match.groups() 31 | print("%-20s %s" % (target, help)) 32 | endef 33 | export PRINT_HELP_PYSCRIPT 34 | 35 | 36 | BROWSER := python -c "$$BROWSER_PYSCRIPT" 37 | 38 | help: 39 | @python -c "$$PRINT_HELP_PYSCRIPT" < $(MAKEFILE_LIST) 40 | 41 | 42 | build: ## build docker image 43 | docker build -t qupath/latest . 44 | 45 | 46 | init: 47 | ./init_singularity_env.sh 48 | 49 | 50 | clean: ## cleanup 51 | docker system prune 52 | 53 | 54 | clean-images: ## remove qupath-stardist docker and singularity images. WARNING: Will affect other users! 55 | docker rmi -f $(IMAGE) 56 | singularity cache clean -f 57 | rm qupath-stardist_latest.sif 58 | 59 | 60 | run-cpu: ## run a script within the star_dist service. This run uses cpus. 61 | docker-compose -f docker-compose.yml run \ 62 | star_dist \ 63 | java -Djava.awt.headless=true -Djava.library.path=/qupath-cpu/build/dist/QuPath-0.2.3/lib/app \ 64 | -jar /qupath-cpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \ 65 | script --image /$(image) /$(script) 66 | 67 | run-gpu: ## run a script within the star_dist service. This run uses gpus. 68 | docker-compose -f docker-compose.yml run \ 69 | -e TF_FORCE_GPU_ALLOW_GROWTH=true -e PER_PROCESS_GPU_MEMORY_FRACTION=0.8 \ 70 | star_dist \ 71 | java -Djava.awt.headless=true -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \ 72 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \ 73 | script --image /$(image) /$(script) 74 | 75 | 76 | build-singularity: init ## pulls qupath-stardist image and converts it to a singularity image 77 | singularity pull --force $(IMAGE) 78 | singularity cache list -v 79 | 80 | 81 | run-singularity-gpu: ## runs the qupath-stardist image in gpu mode 82 | singularity run --env TF_FORCE_GPU_ALLOW_GROWTH=true,PER_PROCESS_GPU_MEMORY_FRACTION=0.8 -B $(PWD)/data:/data,$(PWD)/detections:/detections,$(PWD)/models:/models,$(PWD)/scripts:/scripts --nv $(PWD)/qupath-stardist_latest.sif java -Djava.awt.headless=true \ 83 | -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \ 84 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \ 85 | script --image /$(image) /$(script) 86 | 87 | run-singularity-cpu: ## runs the qupath-stardist image in cpu mode 88 | singularity run -B $(PWD)/data:/data,$(PWD)/detections:/detections,$(PWD)/models:/models,$(PWD)/scripts:/scripts $(PWD)/qupath-stardist_latest.sif java -Djava.awt.headless=true \ 89 | -Djava.library.path=/qupath-cpu/build/dist/QuPath-0.2.3/lib/app \ 90 | -jar /qupath-cpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \ 91 | script --image /$(image) /$(script) 92 | 93 | 94 | 95 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/README.md: -------------------------------------------------------------------------------- 1 | # QuPath bundled with StarDist and Tensorflow 2 | 3 | Sample data in data/sample_data has been downloaded from http://openslide.cs.cmu.edu/download/openslide-testdata/. 4 | More test data from various vendors can be found on this site. 5 | 6 | This repo contains a containerized version of QuPath+StarDist with GPU support. These containers can be built and run using Docker (Section 2) or Singularity (Section 1). 7 | 8 | # Overview 9 | 10 | ## What is QuPath? 11 | QuPath is a digital image analysis platform, that can be quite useful when it comes to analyzing pathology images. Qupath runs using Groovy-based scripts, which can be run through the UI, or in this case, headless through a docker container. 12 | 13 | ## What is StarDist? 14 | Stardist is a nuclear segmentation algorithm that is quite capable in detecting and segmenting cells/nuclei in pathology images. It runs using a tensorflow backend, and has some prebuilt models available to perform cellular segmentation in H&E and IF images. 15 | 16 | When running Stardist in Qupath, nuclear/cellular objects will be created as well as a dictionary of per-cell features such as staining properties ((hematoxylin and eosin staining metrics for H&E), and geometric properties (size, shape, lengths, etc) 17 | 18 | ## How to write and run your own scripts 19 | Some example scripts have been provided to demonstrate some of the functionalities of the QuPath groovy scripting interface. 20 | 21 | `stardist_example.groovy` --> This script will run the StarDist cellular segmentation algorithm based on the given parameters in the file. This will result in cellular objects being created, as well as a dictionary of per-cell features. This script will also show how these cell objects can be exported into two different formats -- geojson and tsv. Exporting in geojson will export each cell's vertices outlining the cell segmentation, but this also means this file can be quite large. On the other hand, TSV does not retain the polygon cellular outlines but just the coordinates of the centroid which is much more compact. 22 | 23 | `detection_first_run_hne_stardist_segmentation.groovy` --> This is a more advanced script that combines multiple aspects of QuPath. It runs StarDist segmentations, as well as a cellular classifier which is able to classify these cellular objects into various classes (in this case lymphocyte vs other cell phenotypes). In addition, this script also performs whole-slide pixel classification using a basic model. The unique part about this script is that upon export, the cellular objects will contain a class (lymphocyte vs other) as well as a parent class (the regional annotation label that the cell objet is in based on the results of the pixel classifier) 24 | 25 | # Section 1: Building and Running Image with Singularity 26 | This image has been prebuilt on Dockerhub (as a docker image) to run via singularity. 27 | 28 | Git clone this repo. 29 | 30 | `` 31 | $ git clone https://github.com/msk-mind/docker.git 32 | `` 33 | 34 | Change directory to this dir and initialize the local singularity environment. `init_singularity_env.sh` creates a localized singularity env in the current directory by creating a `.singularity` sub-directory. This script will need to be re-executed each time you start a new shell and want to run singularity commands against the localized environment directly from the shell, as opposed to using the targets in the makefile. 35 | 36 | ``` 37 | $ cd qupath 38 | $ ./init_singularity_env.sh 39 | ``` 40 | 41 | Build the singularity image. 42 | 43 | `` 44 | $ make build-singularity 45 | `` 46 | 47 | Run the singularity image by specifying the script and image arguments. Like the docker image, the command for executing the container has been designed to use the 'data', 'scripts', 'detections', and 'models' directories to map these files to the container file system. These directories and files must be specified as relative paths. Any data that needs to be referenced outside of detections/, data/, scripts/, and models/ should be mounted using the -B command. To do this, append the new mount (comma separated) to the -B argument in the makefile under run-singularity-cpu and/or run-singularity-gpu as follows: /path/on/host:/bind/path/on/container. 48 | 49 | If successful, `stardist_example.groovy` will output a geo coordinates of cell objects (centroids) to `detections/CMU-1-Small-Region_2_stardist_detections_and_measurements.tsv` 50 | 51 | To run with CPUs: use `run-singularity-cpu`, and for GPUs use `run-singularity-gpu`. 52 | 53 | The first time the `run-singularity-gpu` make target is executed, it tends to take about 4 minutes. Subsequent runs tend to take 20-30 sec. 54 | 55 | Note: adding hosts is not currently supported with the singularity build. 56 | 57 | ``` 58 | $ time make \ 59 | script=scripts/sample_scripts/stardist_example.groovy \ 60 | image=data/sample_data/CMU-1-Small-Region_2.svs run-singularity-gpu 61 | ``` 62 | 63 | To restart with a clean slate, simply delete the `.singularity` sub-directory and re-initialize the local singularity environment by executing `init_singularity_env.sh`. 64 | 65 | 66 | # Section 2: Building and Running Image with Docker (WIP) 67 | ## Section 2: Part 1 -- Build image using Dockerfile 68 | 69 | For building with Docker, there is a small setup step that needs to be done. Using the following links, download these .deb files and copy them to this directory `docker/qupath/`. You will have to create a developer nvidia account in order to do this. 70 | 71 | 1) libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb: 72 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb 73 | 74 | 2) libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb 75 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb 76 | 77 | 3) libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb 78 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb 79 | 80 | 81 | Once you've downloaded these files, you can proceed with the build: 82 | ``` 83 | $ make build 84 | $ docker images 85 | 86 | REPOSITORY TAG IMAGE ID CREATED SIZE 87 | qupath/latest latest ead3bb08477d About a minute ago 2.4GB 88 | adoptopenjdk/openjdk14 x86_64-debian-jdk-14.0.2_12 9350dbb3ad77 4 days ago 516MB 89 | ``` 90 | 91 | ## Section 2: Part 2 -- Run QuPath groovy script using built Docker container 92 | 93 | 94 | The command for executing the container has been designed to use the 'data', 'detections', 'scripts' and 'models' directories to map these files to the container file system. These directories and files must be specified as relative paths. 95 | 96 | If the script uses an external api, the url and IP of the api must be provided to the continer using the host argument. The host IP can be obtained from the URL using the nslookup linux command. Two or more hosts may be specified by specifying multiple host arguments. If the script does not use any external api, the host argument must be specified at least once with an empty string since it is a required argument. 97 | 98 | To run with CPUs: use `run-cpu`, and use GPUs use `run-gpu`. 99 | 100 | Examples: 101 | 102 | This script can be used to import annotationss from the getPathologyAnnotations API 103 | ``` 104 | make host="--add-host=" \ 105 | script=scripts/sample_scripts/import_annot_from_api.groovy \ 106 | image=data/sample_data/HobI20-934829783117.svs run-cpu 107 | ``` 108 | 109 | If successful, `stardist_example.groovy` will output a geojson of cell objects to data/test.geojson 110 | ``` 111 | make host="" \ 112 | script=scripts/sample_scripts/hne/stardist_example.groovy \ 113 | image=data/sample_data/CMU-1-Small-Region_2.svs run-gpu 114 | ``` 115 | 116 | 117 | 118 | ## Section 2: Part 3 -- Cleanup Docker container 119 | Cleans stopped/exited containers, unused networks, dangling images, dangling build caches 120 | 121 | ``` 122 | $ make clean 123 | ``` 124 | 125 | ## WIP/TODOs 126 | - Infrastructure: currently uses single GPU but allocates all, future job scheduler with GPU allocator would be great to fully utiilize available GPUs. (To come with Condor) 127 | - Get GPU working for the docker image (in the event we need to use docker instead of singularity) 128 | 129 | 130 | ## Logs 131 | - started with adoptopenjdk:openjdk14 132 | 133 | - ImageWriterIJTest > testTiffAndZip() FAILED 134 | java.awt.HeadlessException at ImageWriterIJTest.java:46 135 | 4 tests completed, 1 failed 136 | 137 | so, excluded tests with `gradle ... -x test` (see dockerfile) 138 | 139 | - Error: java.io.IOException: Cannot run program "objcopy": error=2, No such file or directory 140 | 141 | so, installed binutils (see dockerfile) 142 | 143 | - 17:18:31.139 [main] [ERROR] q.l.i.s.o.OpenslideServerBuilder - Could not load OpenSlide native libraries 144 | java.lang.UnsatisfiedLinkError: /qupath/build/dist/QuPath-0.2.3/lib/app/libopenslide-jni.so: libxml2.so.2: cannot open shared object file: No such file or directory 145 | 146 | so, installed native libs (see dockerfile) 147 | 148 | - cleanup stopped/exited conntainers, networks, dangling images, dangling build cache 149 | docker system prune 150 | 151 | 152 | 153 | 154 | 155 | ` ` 156 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/data/results/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/data/results/.gitkeep -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/data/slides/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/data/slides/.gitkeep -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/docker-compose.yml: -------------------------------------------------------------------------------- 1 | # docker-compose.yml 2 | # Used to ensure container is executed using host user id and gid (so files written by container are owned by user), and docker image remains user agnostic. 3 | # https://medium.com/faun/set-current-host-user-for-docker-container-4e521cef9ffc 4 | version: '3' 5 | services: 6 | star_dist: 7 | image: docker.io/druvpatel/qupath-stardist 8 | user: $MYUID:$MYGID 9 | working_dir: $PWD 10 | stdin_open: true 11 | volumes: 12 | - $PWD/data:/data 13 | - $PWD/models:/models 14 | - $PWD/scripts:/scripts 15 | - $PWD/detections:/detections 16 | tty: true 17 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/dockerfile: -------------------------------------------------------------------------------- 1 | 2 | # https://hub.docker.com/r/adoptopenjdk/openjdk14 3 | FROM adoptopenjdk/openjdk14:x86_64-debian-jdk-14.0.2_12 4 | MAINTAINER MSK-MIND 5 | LABEL "app"="qupath" 6 | LABEL "version"="0.2.3" 7 | LABEL "description"="qupath bundled with stardist and tensorflow for CPUs. Change -Ptensorflow-cpu=false to switch to GPUs" 8 | RUN apt-get update && \ 9 | apt-get install -y sudo tree wget git vim && \ 10 | apt-get install -y gnupg2 && \ 11 | apt-get install -y binutils && \ 12 | apt-get install -y libxml2-dev libtiff-dev libglib2.0-0 libxcb-shm0-dev libxrender-dev libxcb-render0-dev && \ 13 | apt-get install -y libgl1-mesa-glx && \ 14 | apt-get install -y software-properties-common 15 | 16 | #install cuda 10.2 17 | RUN apt-get update && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-ubuntu1604.pin && \ 18 | sudo mv cuda-ubuntu1604.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \ 19 | sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \ 20 | sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/ /" && \ 21 | sudo apt-get update && \ 22 | sudo apt-get -y install cuda-libraries-10-2 23 | 24 | RUN echo 'export PATH=/usr/local/cuda-10.2/bin:$PATH' >> ~/.bashrc && \ 25 | echo 'export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc && \ 26 | . ~/.bashrc 27 | 28 | 29 | # install cudnn7 for cuda 10.2 30 | ## these can be found on nvidia's website after logging in through a developer account 31 | ## to build this docker image manually, download these files and copy them to the /qupath directory (the same directory as this dockerfile) 32 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb 33 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb 34 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb 35 | COPY libcudnn7_7.6.5.32-1+cuda10.2_amd64.deb . 36 | COPY libcudnn7-dev_7.6.5.32-1+cuda10.2_amd64.deb . 37 | COPY libcudnn7-doc_7.6.5.32-1+cuda10.2_amd64.deb . 38 | RUN apt-get install -y dpkg-dev && apt install ./libcudnn7_7.6.5.32-1+cuda10.2_amd64.deb && \ 39 | apt install ./libcudnn7-dev_7.6.5.32-1+cuda10.2_amd64.deb && \ 40 | apt install ./libcudnn7-doc_7.6.5.32-1+cuda10.2_amd64.deb 41 | 42 | 43 | RUN /bin/sh -c export DISPLAY=:0.0 44 | RUN /bin/sh -c cd / && mkdir qupath-gpu && \ 45 | git clone --branch v0.2.3 https://github.com/qupath/qupath.git /qupath-gpu && \ 46 | cd /qupath-gpu && \ 47 | ./gradlew clean build createPackage -x test -Ptensorflow-gpu=true 48 | 49 | RUN /bin/sh -c cd / && mkdir qupath-cpu && \ 50 | git clone --branch v0.2.3 https://github.com/qupath/qupath.git /qupath-cpu && \ 51 | cd /qupath-cpu && \ 52 | ./gradlew clean build createPackage -x test -Ptensorflow-cpu=true 53 | 54 | RUN apt-get update && apt-get install -y libnccl-dev 55 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/init_singularity_env.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | mkdir -p .singularity/cache 4 | mkdir -p .singularity/tmp 5 | mkdir -p .singularity/lcache 6 | 7 | export SINGULARITY_CACHEDIR=$PWD/.singularity/cache 8 | export SINGULARITY_TMPDIR=$PWD/.singularity/tmp 9 | export SINGULARITY_LOCALCACHEDIR=$PWD/.singularity/lcache 10 | 11 | echo using SINGULARITY_CACHEDIR = $SINGULARITY_CACHEDIR 12 | echo using SINGULARITY_TMPDIR = $SINGULARITY_TMPDIR 13 | echo using SINGULARITY_LOCALCACHEDIR= $SINGULARITY_LOCALCACHEDIR 14 | -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/models/ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json: -------------------------------------------------------------------------------- 1 | { 2 | "object_classifier_type": "OpenCVMLClassifier", 3 | "featureExtractor": { 4 | "feature_extractor_type": "NormalizedFeatureExtractor", 5 | "featureExtractor": { 6 | "feature_extractor_type": "DefaultFeatureExtractor", 7 | "measurements": [ 8 | "Detection probability", 9 | "Nucleus: Area µm^2", 10 | "Nucleus: Length µm", 11 | "Nucleus: Circularity", 12 | "Nucleus: Solidity", 13 | "Nucleus: Max diameter µm", 14 | "Nucleus: Min diameter µm", 15 | "Cell: Area µm^2", 16 | "Cell: Length µm", 17 | "Cell: Circularity", 18 | "Cell: Solidity", 19 | "Cell: Max diameter µm", 20 | "Cell: Min diameter µm", 21 | "Nucleus/Cell area ratio", 22 | "Hematoxylin: Nucleus: Mean", 23 | "Hematoxylin: Nucleus: Median", 24 | "Hematoxylin: Nucleus: Min", 25 | "Hematoxylin: Nucleus: Max", 26 | "Hematoxylin: Nucleus: Std.Dev.", 27 | "Hematoxylin: Cytoplasm: Mean", 28 | "Hematoxylin: Cytoplasm: Median", 29 | "Hematoxylin: Cytoplasm: Min", 30 | "Hematoxylin: Cytoplasm: Max", 31 | "Hematoxylin: Cytoplasm: Std.Dev.", 32 | "Hematoxylin: Membrane: Mean", 33 | "Hematoxylin: Membrane: Median", 34 | "Hematoxylin: Membrane: Min", 35 | "Hematoxylin: Membrane: Max", 36 | "Hematoxylin: Membrane: Std.Dev.", 37 | "Hematoxylin: Cell: Mean", 38 | "Hematoxylin: Cell: Median", 39 | "Hematoxylin: Cell: Min", 40 | "Hematoxylin: Cell: Max", 41 | "Hematoxylin: Cell: Std.Dev.", 42 | "Eosin: Nucleus: Mean", 43 | "Eosin: Nucleus: Median", 44 | "Eosin: Nucleus: Min", 45 | "Eosin: Nucleus: Max", 46 | "Eosin: Nucleus: Std.Dev.", 47 | "Eosin: Cytoplasm: Mean", 48 | "Eosin: Cytoplasm: Median", 49 | "Eosin: Cytoplasm: Min", 50 | "Eosin: Cytoplasm: Max", 51 | "Eosin: Cytoplasm: Std.Dev.", 52 | "Eosin: Membrane: Mean", 53 | "Eosin: Membrane: Median", 54 | "Eosin: Membrane: Min", 55 | "Eosin: Membrane: Max", 56 | "Eosin: Membrane: Std.Dev.", 57 | "Eosin: Cell: Mean", 58 | "Eosin: Cell: Median", 59 | "Eosin: Cell: Min", 60 | "Eosin: Cell: Max", 61 | "Eosin: Cell: Std.Dev." 62 | ] 63 | }, 64 | "normalizer": { 65 | "offsets": [ 66 | 0.0, 67 | 0.0, 68 | 0.0, 69 | 0.0, 70 | 0.0, 71 | 0.0, 72 | 0.0, 73 | 0.0, 74 | 0.0, 75 | 0.0, 76 | 0.0, 77 | 0.0, 78 | 0.0, 79 | 0.0, 80 | 0.0, 81 | 0.0, 82 | 0.0, 83 | 0.0, 84 | 0.0, 85 | 0.0, 86 | 0.0, 87 | 0.0, 88 | 0.0, 89 | 0.0, 90 | 0.0, 91 | 0.0, 92 | 0.0, 93 | 0.0, 94 | 0.0, 95 | 0.0, 96 | 0.0, 97 | 0.0, 98 | 0.0, 99 | 0.0, 100 | 0.0, 101 | 0.0, 102 | 0.0, 103 | 0.0, 104 | 0.0, 105 | 0.0, 106 | 0.0, 107 | 0.0, 108 | 0.0, 109 | 0.0, 110 | 0.0, 111 | 0.0, 112 | 0.0, 113 | 0.0, 114 | 0.0, 115 | 0.0, 116 | 0.0, 117 | 0.0, 118 | 0.0, 119 | 0.0 120 | ], 121 | "scales": [ 122 | 1.0, 123 | 1.0, 124 | 1.0, 125 | 1.0, 126 | 1.0, 127 | 1.0, 128 | 1.0, 129 | 1.0, 130 | 1.0, 131 | 1.0, 132 | 1.0, 133 | 1.0, 134 | 1.0, 135 | 1.0, 136 | 1.0, 137 | 1.0, 138 | 1.0, 139 | 1.0, 140 | 1.0, 141 | 1.0, 142 | 1.0, 143 | 1.0, 144 | 1.0, 145 | 1.0, 146 | 1.0, 147 | 1.0, 148 | 1.0, 149 | 1.0, 150 | 1.0, 151 | 1.0, 152 | 1.0, 153 | 1.0, 154 | 1.0, 155 | 1.0, 156 | 1.0, 157 | 1.0, 158 | 1.0, 159 | 1.0, 160 | 1.0, 161 | 1.0, 162 | 1.0, 163 | 1.0, 164 | 1.0, 165 | 1.0, 166 | 1.0, 167 | 1.0, 168 | 1.0, 169 | 1.0, 170 | 1.0, 171 | 1.0, 172 | 1.0, 173 | 1.0, 174 | 1.0, 175 | 1.0 176 | ], 177 | "missingValue": 0.0 178 | } 179 | }, 180 | "classifier": { 181 | "class": "ANN_MLP", 182 | "statmodel": { 183 | "format": 3, 184 | "layer_sizes": [ 185 | 54, 186 | 2 187 | ], 188 | "activation_function": "SIGMOID_SYM", 189 | "f_param1": 1.0, 190 | "f_param2": 1.0, 191 | "min_val": -9.4999999999999996e-01, 192 | "max_val": 9.4999999999999996e-01, 193 | "min_val1": -9.7999999999999998e-01, 194 | "max_val1": 9.7999999999999998e-01, 195 | "training_params": { 196 | "train_method": "RPROP", 197 | "dw0": 1.0000000000000001e-01, 198 | "dw_plus": 1.2000000000000000e+00, 199 | "dw_minus": 5.0000000000000000e-01, 200 | "dw_min": 1.1920928955078125e-07, 201 | "dw_max": 50.0, 202 | "term_criteria": { 203 | "epsilon": 1.0000000000000000e-02, 204 | "iterations": 1000 205 | } 206 | }, 207 | "input_scale": [ 208 | 1.3884230478630293e+01, 209 | -1.0436028848950381e+01, 210 | 4.0651217206917785e-02, 211 | -1.5288830675717675e+00, 212 | 1.4902576440765389e-01, 213 | -3.2632152740591982e+00, 214 | 1.8651759142111512e+01, 215 | -1.6932751758711024e+01, 216 | 7.9975348973840283e+01, 217 | -7.9813989744728346e+01, 218 | 3.7380041101211908e-01, 219 | -2.9905965766939682e+00, 220 | 6.1167716922545190e-01, 221 | -3.5192343231666916e+00, 222 | 2.3319211712556121e-02, 223 | -2.2874284833089127e+00, 224 | 1.3240275045470298e-01, 225 | -4.8788867016263282e+00, 226 | 1.5880131397602792e+01, 227 | -1.3770291753516185e+01, 228 | 5.1799179370354359e+01, 229 | -5.0826996462122764e+01, 230 | 3.5865866097388122e-01, 231 | -4.7469990799870470e+00, 232 | 4.6927276840148247e-01, 233 | -4.5527534416253266e+00, 234 | 1.0147320988075766e+01, 235 | -3.7268640560457684e+00, 236 | 3.7563079961938541e+00, 237 | -2.8222394651682250e+00, 238 | 3.4207509789013120e+00, 239 | -2.6126520844583649e+00, 240 | 7.2903065872421591e+00, 241 | -1.1929507138294884e+00, 242 | 2.0236865315993695e+00, 243 | -2.8701281475600342e+00, 244 | 8.9732545773844841e+00, 245 | -2.2544507476912941e+00, 246 | 9.2196282204976701e+00, 247 | -2.0842894032683126e+00, 248 | 8.4876590975436681e+00, 249 | -1.5404532839449356e+00, 250 | 2.1340521185877950e+01, 251 | 6.4541155503805547e-01, 252 | 3.0137932665358207e+00, 253 | -2.9686044300429262e+00, 254 | 1.5334500805912315e+01, 255 | -2.8641689072376773e+00, 256 | 8.6606817415675295e+00, 257 | -1.7829889272624488e+00, 258 | 8.5503568388131157e+00, 259 | -1.4597032278579616e+00, 260 | 1.9717646673081735e+01, 261 | 3.0540728997987365e-01, 262 | 3.1075686746922222e+00, 263 | -2.1133265088121331e+00, 264 | 1.2863514501373606e+01, 265 | -2.0644588022774628e+00, 266 | 7.9272915122578809e+00, 267 | -3.2034772169770411e+00, 268 | 6.1492929262431808e+00, 269 | -1.8573746533587634e+00, 270 | 2.1697724885291631e+01, 271 | 6.7287375171150243e-01, 272 | 2.0260795259491804e+00, 273 | -2.8956706968133008e+00, 274 | 7.4670422801461367e+00, 275 | -2.4551901654398010e+00, 276 | 1.7603024815704330e+01, 277 | -2.1020666846556337e+00, 278 | 1.7723557439925614e+01, 279 | -2.0074159355618573e+00, 280 | 4.8008125885821906e+00, 281 | 9.8948794236022031e-01, 282 | 3.0647364222699682e+00, 283 | -1.4691812331912364e+00, 284 | 1.2562629775839415e+01, 285 | -1.5394854943914542e+00, 286 | 1.8172876240023772e+01, 287 | -1.8730041731417946e+00, 288 | 1.7708167986592180e+01, 289 | -1.8311933989238971e+00, 290 | 8.8517673621658144e+00, 291 | 8.2400096007605295e-01, 292 | 7.7696295729603495e+00, 293 | -2.0826594164420631e+00, 294 | 4.8397922833946005e+01, 295 | -2.6474752710474303e+00, 296 | 1.7803035271910812e+01, 297 | -1.8877641496042963e+00, 298 | 1.7244050875851336e+01, 299 | -1.8057484529102026e+00, 300 | 1.3645091260729304e+01, 301 | 2.5527867627681750e-01, 302 | 9.2220882392120895e+00, 303 | -2.1773014024968322e+00, 304 | 4.4806689473518418e+01, 305 | -2.3295121210659278e+00, 306 | 1.8800432115096598e+01, 307 | -2.0465181349344106e+00, 308 | 1.8251381185567340e+01, 309 | -1.9319916323956488e+00, 310 | 4.7829401642175453e+00, 311 | 1.0379696088234613e+00, 312 | 3.0625094741185959e+00, 313 | -1.4977509313276964e+00, 314 | 2.2666122841938190e+01, 315 | -1.9651871233359419e+00 316 | ], 317 | "output_scale": [ 318 | 1.0, 319 | 0.0, 320 | 1.0, 321 | 0.0 322 | ], 323 | "inv_output_scale": [ 324 | 1.0, 325 | 0.0, 326 | 1.0, 327 | 0.0 328 | ], 329 | "weights": [ 330 | [ 331 | 5.3131502607590453e-01, 332 | 1.3319032953543575e-01, 333 | 7.5677286629269147e-02, 334 | 3.1806455147243787e-03, 335 | -7.5296388482771714e-01, 336 | 1.7410383250467160e-01, 337 | -6.5718326911217329e-01, 338 | 5.6077796882772435e-01, 339 | 1.4355113845598161e-01, 340 | -2.7589820288151073e-01, 341 | -9.3726938824941830e-01, 342 | 1.1273909427692799e+00, 343 | 3.3792224510181301e-01, 344 | 1.2580719918406363e-01, 345 | -7.4447272743560278e-01, 346 | 5.0350170458677901e-02, 347 | -8.1057370658860572e-01, 348 | 6.4570399303329262e-01, 349 | -2.8815758124418311e-01, 350 | 1.0613828078319948e-01, 351 | 3.7224267919660070e-01, 352 | -9.2478641209280732e-02, 353 | -3.2678283297318611e-01, 354 | 1.3640023219146697e+00, 355 | -1.0364842630454227e+00, 356 | 1.6912933555400111e+00, 357 | -5.3676523232830187e-01, 358 | 7.4875001179266654e-01, 359 | 1.1938863816812013e+00, 360 | -2.1728246328279646e+00, 361 | 1.9005019638922551e+00, 362 | -3.6124597246245034e+00, 363 | -1.6750112022431682e-01, 364 | -1.9736721466993229e-01, 365 | -5.3750827455159578e-03, 366 | 1.0342167736052412e+00, 367 | -1.3496084665863162e-01, 368 | -1.3093091792621530e+00, 369 | -1.5330208939125556e-01, 370 | -7.7381676828638113e-03, 371 | 5.1686721932381874e-01, 372 | 6.3168367752957066e-01, 373 | -7.9753752002407829e-01, 374 | 6.6204874496166144e-01, 375 | -2.6004031347000062e-01, 376 | -4.2408458497567647e-01, 377 | -5.4686580456146239e-01, 378 | 4.2267154081098107e-01, 379 | 6.2117431785880362e-01, 380 | 7.0263537699839507e-01, 381 | -1.1724467699660965e+00, 382 | -3.7546113575704887e-01, 383 | -9.4388902234390426e-01, 384 | -6.1731725412889404e-01, 385 | -7.3112030637524750e-01, 386 | 2.4387987377295198e-01, 387 | 9.6708275345891925e-01, 388 | 1.5507609635086583e-01, 389 | -5.2843267148099748e-01, 390 | -1.2043873253617352e-01, 391 | -6.0703943680523076e-01, 392 | 4.1308862491398889e-02, 393 | 2.2855201109314574e-01, 394 | 1.1902682401126872e+00, 395 | -1.4586465862716863e-01, 396 | -5.8694397098993642e-02, 397 | 2.0510988973135409e+00, 398 | -4.4058807822921127e+00, 399 | 5.3717924422466412e-01, 400 | 1.0892272856150961e-01, 401 | -4.5976354260767632e-01, 402 | -4.5710278828830914e-02, 403 | -6.7111726958554385e-02, 404 | -1.1266965359361265e-01, 405 | 1.3495583249953398e+00, 406 | -1.0543581371422708e+00, 407 | 6.7783428476582031e-01, 408 | -7.5820186675059664e-01, 409 | 6.8271281498259717e-01, 410 | -5.8791671161933907e-02, 411 | -9.3443263461570902e-01, 412 | -9.4366261563277237e-02, 413 | -6.5883480364552061e-01, 414 | -6.4001380261849916e-01, 415 | 1.0451022662285130e-01, 416 | 4.1599372517711841e-01, 417 | -5.1432979559771852e-01, 418 | 1.5904643675018923e-01, 419 | -8.4432473820839860e-01, 420 | 4.1337428950857474e-01, 421 | 9.8918057798433190e-01, 422 | 9.2360472791201131e-01, 423 | 7.1413128861014574e-01, 424 | 2.5961010759479131e-01, 425 | -7.9186179150659919e-01, 426 | 3.1202167349121934e-01, 427 | -3.3639543745691192e-01, 428 | 4.5762746282061900e-01, 429 | 9.5306022142570823e-01, 430 | -2.5812084340957475e-01, 431 | -3.7019001215589353e-01, 432 | -4.3609373559489228e-01, 433 | 5.9580211866483324e-01, 434 | -7.7101094798609293e-01, 435 | 3.8444305971371939e-01, 436 | -1.8652039138823739e+00, 437 | 6.6781601499775434e-01, 438 | -2.6964085469021148e-01, 439 | -1.7095248939777175e+00, 440 | 1.9560948934635687e+00 441 | ] 442 | ] 443 | } 444 | }, 445 | "pathClasses": [ 446 | { 447 | "name": "Lymphocyte", 448 | "colorRGB": -14336 449 | }, 450 | { 451 | "name": "Other", 452 | "colorRGB": -16744320 453 | } 454 | ], 455 | "requestProbabilityEstimate": false, 456 | "filter": "DETECTIONS_ALL", 457 | "timestamp": 1612801125986 458 | } -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/models/he_heavy_augment/saved_model.pb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/saved_model.pb -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.data-00000-of-00001 -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.index -------------------------------------------------------------------------------- /hne-feature-extraction/qupath/scripts/stardist_nuclei_and_lymphocytes.groovy: -------------------------------------------------------------------------------- 1 | import qupath.tensorflow.stardist.StarDist2D 2 | import qupath.lib.io.GsonTools 3 | import static qupath.lib.gui.scripting.QPEx.* 4 | setImageType('BRIGHTFIELD_H_E'); 5 | setColorDeconvolutionStains('{"Name" : "H&E default", "Stain 1" : "Hematoxylin", "Values 1" : "0.60968 0.65246 0.4501 ", "Stain 2" : "Eosin", "Values 2" : "0.21306 0.87722 0.43022 ", "Background" : " 243 243 243 "}'); 6 | 7 | def imageData = getCurrentImageData() 8 | def server = imageData.getServer() 9 | 10 | // get dimensions of slide 11 | minX = 0 12 | minY = 0 13 | maxX = server.getWidth() 14 | maxY = server.getHeight() 15 | 16 | print 'maxX' + maxX 17 | print 'maxY' + maxY 18 | 19 | // create rectangle roi (over entire area of image) for detections to be run over 20 | def plane = ImagePlane.getPlane(0, 0) 21 | def roi = ROIs.createRectangleROI(minX, minY, maxX-minX, maxY-minY, plane) 22 | def annotationROI = PathObjects.createAnnotationObject(roi) 23 | addObject(annotationROI) 24 | selectAnnotations(); 25 | def pathModel = '/models/he_heavy_augment' 26 | def cell_expansion_factor = 3.0 27 | def cellConstrainScale = 1.0 28 | def stardist = StarDist2D.builder(pathModel) 29 | .threshold(0.5) // Probability (detection) threshold 30 | .normalizePercentiles(1, 99) // Percentile normalization 31 | .pixelSize(0.5) // Resolution for detection 32 | .cellExpansion(cell_expansion_factor) // Approximate cells based upon nucleus expansion 33 | .cellConstrainScale(cellConstrainScale) // Constrain cell expansion using nucleus size 34 | .measureShape() // Add shape measurements 35 | .measureIntensity() // Add cell measurements (in all compartments) 36 | .includeProbability(true) // Add probability as a measurement (enables later filtering) 37 | .nThreads(10) 38 | .build() 39 | // select rectangle object created 40 | selectObjects { 41 | //Some criteria here 42 | return it == annotationROI 43 | } 44 | def pathObjects = getSelectedObjects() 45 | print 'Selected ' + pathObjects.size() 46 | // stardist segmentations 47 | stardist.detectObjects(imageData, pathObjects) 48 | def celldetections = getDetectionObjects() 49 | print 'Detected' + celldetections.size() 50 | selectDetections(); 51 | // obj classifier 52 | runObjectClassifier("/models/ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json") 53 | def filename = GeneralTools.getNameWithoutExtension(server.getMetadata().getName()) 54 | saveDetectionMeasurements('/data/results/' + filename + '.tsv') 55 | -------------------------------------------------------------------------------- /hne-feature-extraction/visualizations/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/visualizations/.gitkeep -------------------------------------------------------------------------------- /license.md: -------------------------------------------------------------------------------- 1 | **Onco-Fusion Terms of Use** 2 | 3 | **PLEASE READ THIS DOCUMENT CAREFULLY BEFORE YOU ACCESS OR USE Onco-Fusion. BY ACCESSING ANY PORTION OF Onco-Fusion, YOU AGREE TO BE BOUND BY THE TERMS AND CONDITIONS SET FORTH BELOW. IF YOU DO NOT WISH TO BE BOUND BY THESE TERMS AND CONDITIONS, PLEASE DO NOT ACCESS Onco-Fusion.** 4 | 5 | **Onco-Fusion** is developed and maintained by Memorial Sloan Kettering Cancer Center ("MSK," "we", or "us") to **extract features from histopathologic and radiologic imaging and study their contributions to multimodal prognostic models.** MSK may, from time to time, update the software and other content on 6 | **https://github.com/kmboehm/onco-fusion** ("Content"). MSK makes no warranties or representations, and hereby disclaims any warranties, express or implied, with respect to any of the Content, including as to the present accuracy, completeness, timeliness, adequacy, or usefulness of any of the Content. The entire risk as to the quality and performance of the Content is with you. By using this Content, you agree that MSK will not be liable for any losses or damages arising from your use of or reliance on the Content, or other websites or information to which this Content may be linked, including any general, special, incidental or consequential damages arising out of the use or inability to use the Content including, but not limited to, loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the Content to operate with any other software, programs, source code, etc. 7 | 8 | By making any use of the Content or submitting any information or data along with such to us, you authorize MSK to copy, modify, display, distribute, perform, use, publish, and otherwise exploit the same for any and all purposes, all without compensation to you, for as long as we decide (collectively, the "Use Rights"). In addition, you authorize MSK to grant any third party some or all of the Use Rights. By way of example, and not limitation, the Use Rights include the right for us to publish any data or information submitted in whole or in part for as long as we choose. By providing any data or information, you represent and warrant that (i) you own all rights in and to the information or data (including any related intellectual property rights) or have sufficient authority and right to provide the content and to grant the Use Rights; (ii) your submission of the information or data and grant to us of Use Rights do not violate or conflict with the rights of other persons, or breach your obligations to other persons; and (iii) the information or data does not include or contain any personally identifiable information (PII) or protected health information (PHI). 9 | 10 | DO NOT submit personally identifiable information (PII) or protected health information (PHI) in connection with any information or data or otherwise. 11 | 12 | You may use **Onco-Fusion** , the underlying content, and any output therefrom for personal for academic research and noncommercial purposes only, including teaching and research at universities, colleges and other educational institutions. You may not use it for any other purpose. You may not publish the Content in any capacity, including in scientific or academic journals or literature, or the results of such research, without MSK's express written permission. You may not otherwise redistribute or share the Content with any third party, in part or in whole, for any purpose, without the express written permission of MSK. Any use of the Content for commercial purposes, including but not restricted to consulting activities, design of commercial hardware or software products, and a commercial entity participating in research projects, requires MSK's express written permission and provision of an appropriate license. 13 | 14 | Without limiting the generality of the foregoing, you may not use any part of the **Onco-Fusion** , the underlying Content or the output for any other purpose, including: 15 | 16 | 1. use or incorporation into a commercial product or towards the performance of a commercial service; 17 | 2. research use in a commercial setting; 18 | 3. diagnosis, treatment or use for patient care or the provision of medical services; or 19 | 4. generation of reports in a medical, laboratory, hospital or other patient care setting. 20 | 21 | You may not copy, transfer, reproduce, modify, sell, sublicense, distribute or create derivative works of **Onco-Fusion** or the underlying Content for any commercial purpose without the express permission of MSK. Any attempt otherwise to copy, transfer, modify, sublicense or distribute the Content is void, and will automatically terminate your rights under these Terms of Use. 22 | 23 | The output of **Onco-Fusion** and the underlying Content is not a substitute for professional medical help, judgment, or advice. Use of **Onco-Fusion** does not create a physician-patient relationship or in any way make a person a patient of MSK. A physician or other qualified health provider should always be consulted for any health problem or medical condition. 24 | 25 | Neither these Terms or Use nor the availability of **Onco-Fusion** should be understood to create an obligation or expectation that MSK will continue to make **Onco-Fusion** available. MSK may discontinue or restrict the availability of **Onco-Fusion** at any time. MSK may also modify these Terms or Use at any time. 26 | 27 | Any use of the Content is subject to MSK's intellectual property rights, including any granted, pending or filed provisional patent applications, and any other patent, copyright, trademark, trade secret or other intellectual property rights. MSK respects the intellectual property rights of others, just as it expects others to respect its intellectual property. If you believe that any content (including data or information uploaded by you) to the Content or other activity taking place on the website where it is hosted constitutes infringement of a work protected by copyright, please notify us as follows: 28 | 29 | Email **boehmk@mskcc.org**. 30 | 31 | Your notice must comply with the Digital Millennium Copyright Act (17 U.S.C. §512) (the "DMCA"). Upon receipt of a compliant notice, we will respond and proceed in accordance with the DMCA. 32 | 33 | By using **Onco-Fusion** , you consent to the jurisdiction and venue of the state and federal courts located in New York City, New York, USA, for any claims related to or arising from your use of **Onco-Fusion** or your violation of these Terms of Use, and agree that you will not bring any claims against MSK that relate to or arise from the foregoing except in those courts. 34 | 35 | If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of these Terms of Use, they do not excuse you from the conditions of these Terms of Use. 36 | 37 | If any provision of these Terms of Use is held to be invalid or unenforceable, then such provision shall be struck and the remaining provisions shall be enforced. Headings are for reference purposes only and in no way define, limit, construe, or describe the scope or extent of such section. MSK's failure to act with respect to a breach by you or others does not waive its right to act with respect to subsequent or similar breaches. This agreement and the terms and conditions contained herein set forth the entire understanding and agreement between MSK and you with respect to the subject matter hereof and supersede any prior or contemporaneous understanding, whether written or oral. 38 | 39 | Inquiries about the Content should be directed to **boehmk@mskcc.org**. 40 | 41 | If you are interested in using **Onco-Fusion** for purposes beyond those permitted by these Terms of Use, please contact **boehmk@mskcc.org** to inquire concerning the availability of a license. 42 | 43 | #151032716\_v1 44 | -------------------------------------------------------------------------------- /survival-modeling/environment.yml: -------------------------------------------------------------------------------- 1 | name: sklearn 2 | channels: 3 | - conda-forge 4 | - bioconda 5 | - defaults 6 | dependencies: 7 | - affine=2.3.0=py_0 8 | - albumentations=0.5.2=pyhd8ed1ab_0 9 | - aom=3.2.0=he49afe7_2 10 | - astor=0.8.1=pyh9f0ad1d_0 11 | - attrs=21.2.0=pyhd8ed1ab_0 12 | - autograd=1.3=py_0 13 | - autograd-gamma=0.5.0=pyh9f0ad1d_0 14 | - bleach=4.1.0=pyhd8ed1ab_0 15 | - blosc=1.21.0=he49afe7_0 16 | - bokeh=2.4.2=py39h6e9494a_0 17 | - boost-cpp=1.74.0=hff03dee_4 18 | - brotli=1.0.9=h0d85af4_6 19 | - brotli-bin=1.0.9=h0d85af4_6 20 | - brotlipy=0.7.0=py39h89e85a6_1003 21 | - brunsli=0.1=h046ec9c_0 22 | - bzip2=1.0.8=h0d85af4_4 23 | - c-ares=1.18.1=h0d85af4_0 24 | - c-blosc2=2.0.4=ha1a4663_1 25 | - ca-certificates=2021.10.8=h033912b_0 26 | - cairo=1.16.0=he43a7df_1008 27 | - certifi=2021.10.8=py39h6e9494a_1 28 | - cffi=1.15.0=py39he338e87_0 29 | - cfitsio=3.470=h01dc385_7 30 | - chardet=4.0.0=py39h6e9494a_2 31 | - charls=2.2.0=h046ec9c_0 32 | - click=7.1.2=pyh9f0ad1d_0 33 | - click-plugins=1.1.1=py_0 34 | - cligj=0.7.2=pyhd8ed1ab_1 35 | - cloudpickle=2.0.0=pyhd8ed1ab_0 36 | - colorama=0.4.4=pyh9f0ad1d_0 37 | - colorcet=3.0.0=pyhd8ed1ab_0 38 | - cryptography=36.0.0=py39h209aa08_0 39 | - curl=7.80.0=hf45b732_1 40 | - cycler=0.11.0=pyhd8ed1ab_0 41 | - cytoolz=0.11.2=py39h89e85a6_1 42 | - dask-core=2021.12.0=pyhd8ed1ab_0 43 | - et_xmlfile=1.0.1=py_1001 44 | - expat=2.4.1=he49afe7_0 45 | - ffmpeg=4.4.1=h79e7b16_0 46 | - fontconfig=2.13.1=h10f422b_1005 47 | - fonttools=4.28.3=py39h89e85a6_0 48 | - formulaic=0.2.4=pyhd8ed1ab_0 49 | - freetype=2.10.4=h4cff582_1 50 | - freexl=1.0.6=h0d85af4_0 51 | - fsspec=2021.11.1=pyhd8ed1ab_0 52 | - future=0.18.2=py39h6e9494a_4 53 | - geos=3.9.1=he49afe7_2 54 | - geotiff=1.6.0=h26421ea_6 55 | - gettext=0.19.8.1=hd1a6beb_1008 56 | - giflib=5.2.1=hbcb3906_2 57 | - gmp=6.2.1=h2e338ed_0 58 | - gnutls=3.6.13=h756fd2b_1 59 | - graphite2=1.3.13=h2e338ed_1001 60 | - harfbuzz=2.9.1=h159f659_1 61 | - hdf4=4.2.15=hefd3b78_3 62 | - hdf5=1.10.6=nompi_hc5d9132_1114 63 | - holoviews=1.14.7=pyhd8ed1ab_0 64 | - icu=68.2=he49afe7_0 65 | - idna=2.10=pyh9f0ad1d_0 66 | - imagecodecs=2021.8.26=py39he5b32f2_1 67 | - imageio=2.13.3=pyh239f2a4_0 68 | - imgaug=0.4.0=py_1 69 | - importlib-metadata=4.10.1=py39h6e9494a_0 70 | - interface_meta=1.2.4=pyhd8ed1ab_0 71 | - jasper=1.900.1=h636a363_1006 72 | - jbig=2.1=h0d85af4_2003 73 | - jdcal=1.4.1=py_0 74 | - jinja2=3.0.3=pyhd8ed1ab_0 75 | - joblib=1.1.0=pyhd8ed1ab_0 76 | - jpeg=9d=hbcb3906_0 77 | - json-c=0.15=hcb556a6_0 78 | - jxrlib=1.1=h35c211d_2 79 | - kealib=1.4.14=h31dd65d_2 80 | - kiwisolver=1.3.2=py39hf018cea_1 81 | - krb5=1.19.2=hcfbf3a7_3 82 | - lame=3.100=h35c211d_1001 83 | - lcms2=2.12=h577c468_0 84 | - lerc=3.0=he49afe7_0 85 | - libaec=1.0.6=he49afe7_0 86 | - libblas=3.9.0=12_osx64_openblas 87 | - libbrotlicommon=1.0.9=h0d85af4_6 88 | - libbrotlidec=1.0.9=h0d85af4_6 89 | - libbrotlienc=1.0.9=h0d85af4_6 90 | - libcblas=3.9.0=12_osx64_openblas 91 | - libcurl=7.80.0=hf45b732_1 92 | - libcxx=12.0.1=habf9029_0 93 | - libdap4=3.20.6=h3e144a0_2 94 | - libdeflate=1.8=h0d85af4_0 95 | - libedit=3.1.20191231=h0678c8f_2 96 | - libev=4.33=haf1e3a3_1 97 | - libffi=3.4.2=h0d85af4_5 98 | - libgdal=3.1.4=h85d2021_18 99 | - libgfortran=5.0.0=9_3_0_h6c81a4c_23 100 | - libgfortran5=9.3.0=h6c81a4c_23 101 | - libglib=2.70.2=hf1fb8c0_0 102 | - libiconv=1.16=haf1e3a3_0 103 | - libkml=1.3.0=h8fd9edb_1014 104 | - liblapack=3.9.0=12_osx64_openblas 105 | - liblapacke=3.9.0=12_osx64_openblas 106 | - libnetcdf=4.8.1=nompi_hb4d10b0_100 107 | - libnghttp2=1.43.0=h6f36284_1 108 | - libopenblas=0.3.18=openmp_h3351f45_0 109 | - libopencv=4.5.3=py39h852ad08_1 110 | - libpng=1.6.37=h7cec526_2 111 | - libpq=13.5=hea3049e_1 112 | - libprotobuf=3.16.0=hcf210ce_0 113 | - librttopo=1.1.0=h5413771_6 114 | - libspatialite=5.0.1=h035f608_6 115 | - libssh2=1.10.0=h52ee1ee_2 116 | - libtiff=4.3.0=hd146c10_2 117 | - libvpx=1.11.0=he49afe7_3 118 | - libwebp-base=1.2.1=h0d85af4_0 119 | - libxml2=2.9.12=h93ec3fd_0 120 | - libzip=1.8.0=h8b0c345_1 121 | - libzlib=1.2.11=h9173be1_1013 122 | - libzopfli=1.0.3=h046ec9c_0 123 | - lifelines=0.26.3=pyhd8ed1ab_0 124 | - llvm-openmp=12.0.1=hda6cdc1_1 125 | - locket=0.2.0=py_2 126 | - lz4-c=1.9.3=he49afe7_1 127 | - markdown=3.3.6=pyhd8ed1ab_0 128 | - markupsafe=2.0.1=py39h89e85a6_1 129 | - matplotlib-base=3.5.0=py39hb07454d_0 130 | - munkres=1.1.4=pyh9f0ad1d_0 131 | - ncurses=6.2=h2e338ed_4 132 | - nettle=3.6=hedd7734_0 133 | - networkx=2.6.3=pyhd8ed1ab_1 134 | - numpy=1.21.4=py39h7eed0ac_0 135 | - olefile=0.46=pyh9f0ad1d_1 136 | - opencv=4.5.3=py39h6e9494a_1 137 | - openh264=2.1.1=hfd3ada9_0 138 | - openjpeg=2.4.0=h6e7aa92_1 139 | - openpyxl=3.0.6=pyhd8ed1ab_0 140 | - openssl=1.1.1l=h0d85af4_0 141 | - packaging=21.3=pyhd8ed1ab_0 142 | - pandas=1.3.4=py39h4d6be9b_1 143 | - panel=0.12.6=pyhd8ed1ab_0 144 | - param=1.12.0=pyh6c4a22f_0 145 | - partd=1.2.0=pyhd8ed1ab_0 146 | - patsy=0.5.2=pyhd8ed1ab_0 147 | - pcre=8.45=he49afe7_0 148 | - pillow=8.4.0=py39he9bb72f_0 149 | - pip=21.3.1=pyhd8ed1ab_0 150 | - pixman=0.40.0=hbcb3906_0 151 | - poppler=21.03.0=h640f9a4_0 152 | - poppler-data=0.4.11=hd8ed1ab_0 153 | - postgresql=13.5=he8fe76e_1 154 | - proj=8.0.1=h1512c50_0 155 | - py-opencv=4.5.3=py39h71a6800_1 156 | - pycparser=2.21=pyhd8ed1ab_0 157 | - pyct=0.4.6=py_0 158 | - pyct-core=0.4.6=py_0 159 | - pyopenssl=21.0.0=pyhd8ed1ab_0 160 | - pyparsing=3.0.6=pyhd8ed1ab_0 161 | - pysocks=1.7.1=py39h6e9494a_4 162 | - python=3.9.7=h1248fe1_3_cpython 163 | - python-dateutil=2.8.2=pyhd8ed1ab_0 164 | - python_abi=3.9=2_cp39 165 | - pytz=2021.3=pyhd8ed1ab_0 166 | - pyviz_comms=2.1.0=pyhd8ed1ab_0 167 | - pywavelets=1.2.0=py39hc89836e_1 168 | - pyyaml=6.0=py39h89e85a6_3 169 | - rasterio=1.2.0=py39h2b252dd_0 170 | - readline=8.1=h05e3726_0 171 | - requests=2.25.1=pyhd3deb0d_0 172 | - scikit-image=0.19.0=py39h4d6be9b_0 173 | - scikit-learn=1.0.1=py39hd4eea88_2 174 | - scipy=1.7.3=py39h056f1c0_0 175 | - seaborn=0.11.2=hd8ed1ab_0 176 | - seaborn-base=0.11.2=pyhd8ed1ab_0 177 | - setuptools=59.4.0=py39h6e9494a_0 178 | - shapely=1.8.0=py39h1d9c377_0 179 | - six=1.16.0=pyh6c4a22f_0 180 | - snappy=1.1.8=hb1e8313_3 181 | - snuggs=1.4.7=py_0 182 | - sqlite=3.37.0=h23a322b_0 183 | - statsmodels=0.13.1=py39hc89836e_0 184 | - svt-av1=0.8.7=he49afe7_1 185 | - threadpoolctl=3.0.0=pyh8a188c0_0 186 | - tifffile=2021.11.2=pyhd8ed1ab_0 187 | - tiledb=2.3.4=h8370e7a_0 188 | - tk=8.6.11=h5dbffcc_1 189 | - toolz=0.11.2=pyhd8ed1ab_0 190 | - tornado=6.1=py39h89e85a6_2 191 | - tqdm=4.62.3=pyhd8ed1ab_0 192 | - typing_extensions=4.0.1=pyha770c72_0 193 | - tzcode=2021e=h0d85af4_0 194 | - tzdata=2021e=he74cb21_0 195 | - urllib3=1.26.7=pyhd8ed1ab_0 196 | - webencodings=0.5.1=py_1 197 | - wheel=0.37.0=pyhd8ed1ab_1 198 | - wrapt=1.13.3=py39h89e85a6_1 199 | - x264=1!161.3030=h0d85af4_1 200 | - x265=3.5=h940c156_1 201 | - xerces-c=3.2.3=h379762d_3 202 | - xlrd=2.0.1=pyhd8ed1ab_3 203 | - xz=5.2.5=haf1e3a3_1 204 | - yaml=0.2.5=haf1e3a3_0 205 | - zfp=0.5.5=h4a89273_8 206 | - zipp=3.7.0=pyhd8ed1ab_1 207 | - zlib=1.2.11=h9173be1_1013 208 | - zstd=1.5.0=h582d3a0_0 209 | - pip: 210 | - factor-analyzer==0.3.1 211 | - littleutils==0.2.2 212 | - matplotlib-venn==0.11.6 213 | - outdated==0.2.0 214 | - pandas-flavor==0.2.0 215 | - pingouin==0.3.11 216 | - statannot==0.2.3 217 | - tabulate==0.8.7 218 | - varclushi==0.1.0 219 | - xarray==0.16.2 220 | prefix: /Users/boehmk/anaconda3/envs/sklearn 221 | -------------------------------------------------------------------------------- /survival-modeling/figures/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/barplots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/barplots/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/crs_plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/crs_plots/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/feature_plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/feature_plots/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/forest_plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/forest_plots/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/km_plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/km_plots/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/figures/multimodal/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/multimodal/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/results/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/results/crs/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/crs/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/results/model_summaries/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/model_summaries/.gitkeep -------------------------------------------------------------------------------- /survival-modeling/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import yaml 4 | import os 5 | 6 | with open('../global_config.yaml', 'r') as f: 7 | CONFIGS = yaml.safe_load(f) 8 | DATA_DIR = CONFIGS['data_dir'] 9 | CODE_DIR = CONFIGS['code_dir'] 10 | 11 | 12 | def load_crs(binarize=False, drop_net=False): 13 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'crs_df.csv')) 14 | df['Patient ID'] = df['Patient ID'].astype(str).apply(lambda x: x.zfill(3)) 15 | df = df.set_index('Patient ID') 16 | if drop_net: 17 | df = df[df.CRS != 'NET'] 18 | if binarize: 19 | df.loc[df.CRS=='2', 'CRS'] = '1/2' 20 | df.loc[df.CRS=='1', 'CRS'] = '1/2' 21 | df.loc[df.CRS=='3', 'CRS'] = '3/NET' 22 | df.loc[df.CRS=='NET', 'CRS'] = '3/NET' 23 | df.loc[df.CRS.str.contains('1'), 'CRS'] = '1/2' 24 | df['CRS'] = df['CRS'].astype(str) 25 | return df 26 | 27 | 28 | def load_os(): 29 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv')) 30 | df['Patient ID'] = df['Patient ID'].astype(str) 31 | df = df.set_index('Patient ID') 32 | df = df[['duration.OS', 'observed.OS']] 33 | df = df.rename(columns={'duration.OS': 'duration', 34 | 'observed.OS': 'observed'}) 35 | return df 36 | 37 | def load_pfs(): 38 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv')) 39 | df['Patient ID'] = df['Patient ID'].astype(str) 40 | df = df.set_index('Patient ID') 41 | df = df[['duration.PFS', 'observed.PFS']] 42 | df = df.rename(columns={'duration.PFS': 'duration', 43 | 'observed.PFS': 'observed'}) 44 | return df 45 | 46 | def load_clin(cols=['Complete gross resection', 'stage', 'age', 'Type of surgery', 'adnexal_lesion', 'omental_lesion', 'Received PARPi']): 47 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv')) 48 | df['Patient ID'] = df['Patient ID'].astype(str) 49 | df = df.set_index('Patient ID') 50 | df = df[cols] 51 | return df 52 | 53 | def load_pathomic_features(): 54 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'hne-feature-extraction', 'tissue_tile_features', 'reference_hne_features.csv')) 55 | df['Patient ID'] = df['Patient ID'].astype(str) 56 | df = df.set_index('Patient ID') 57 | return df 58 | 59 | 60 | def load_radiomic_features(site='omentum'): 61 | if site == 'omentum': 62 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'ct-feature-extraction', 'features', 'ct_features_omentum.csv')) 63 | elif site == 'ovary': 64 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'ct-feature-extraction', 'features', 'ct_features_ovary.csv')) 65 | else: 66 | raise NotImplementedError("Unknown radiomic site: {}".format(site)) 67 | df['Patient ID'] = df['Patient ID'].astype(str) 68 | df = df.set_index('Patient ID') 69 | return df 70 | 71 | 72 | 73 | def load_all_ids(imaging_only=False): 74 | radiomic_ids = set(load_radiomic_features('omentum').index).union(set(load_radiomic_features('ovary').index)) 75 | pathomic_ids = set(load_pathomic_features().index) 76 | if imaging_only: 77 | ids = pathomic_ids.union(radiomic_ids) 78 | else: 79 | clinical_ids = set(load_clin().index) 80 | genomic_ids = set(load_genom().index) 81 | ids = pathomic_ids.union(radiomic_ids).union(genomic_ids).union(clinical_ids) 82 | return ids 83 | 84 | 85 | def load_genom(): 86 | genom = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'genomic_df.csv')) 87 | genom['Patient ID'] = genom['Patient ID'].astype(str) 88 | genom = genom.set_index('Patient ID') 89 | genom.loc[genom['HRD status'] == 'HRP', 'hrd_status'] = False 90 | genom.loc[genom['HRD status'] == 'HRD', 'hrd_status'] = True 91 | genom = genom.dropna(subset=['HRD status']) 92 | genom.hrd_status = genom.hrd_status.astype(bool) 93 | genom = genom[['hrd_status']] 94 | 95 | return genom 96 | -------------------------------------------------------------------------------- /tissue-type-training/checkpoints/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/checkpoints/.gitkeep -------------------------------------------------------------------------------- /tissue-type-training/checkpoints/tissue_type_classifier_weights.torch: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/checkpoints/tissue_type_classifier_weights.torch -------------------------------------------------------------------------------- /tissue-type-training/config.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from torch import device 3 | import yaml 4 | import os 5 | 6 | 7 | with open('../global_config.yaml', 'r') as f: 8 | CONFIGS = yaml.safe_load(f) 9 | DATA_DIR = CONFIGS['data_dir'] 10 | 11 | parser = argparse.ArgumentParser() 12 | 13 | parser.add_argument('--preprocessed_cohort_csv_path', 14 | type=str, 15 | default='preprocessed_msk_os_h&e_cohort.csv', 16 | help='Full path to CSV file describing whole slide images and outcomes.') 17 | 18 | parser.add_argument('--checkpoint_path', 19 | type=str, 20 | default='', 21 | help='Location of checkpoint to load for preds OR to resume for training.') 22 | 23 | parser.add_argument('--experiment_name', 24 | type=str, 25 | default='EXPERIMENT_NAME', 26 | help='name under which to store checkpoints') 27 | 28 | parser.add_argument('--crossval', 29 | type=int, 30 | default=0, 31 | help='0: no xval, >0: k-fold xval') 32 | 33 | parser.add_argument('--normalize', 34 | action='store_true', 35 | default=False, 36 | help='whether to apply macenko normalization') 37 | 38 | parser.add_argument('--cohort_csv_path', 39 | type=str, 40 | default='msk_os_h&e_cohort.csv', 41 | help='Full path to CSV file describing whole slide images and outcomes.') 42 | 43 | parser.add_argument('--tile_size', 44 | type=int, 45 | default=512, 46 | help='Edge length of each tile used for training/evaluation.') 47 | 48 | parser.add_argument('--tile_selection_type', 49 | type=str, 50 | default='manual', 51 | help='manual or otsu') 52 | 53 | parser.add_argument('--otsu_threshold', 54 | type=float, 55 | default=0.25, 56 | help='Percentage foreground required to include tile.') 57 | 58 | parser.add_argument('--purple_threshold', 59 | type=float, 60 | default=0.25, 61 | help='Percentage purple required to include tile.') 62 | 63 | parser.add_argument('--magnification', 64 | type=int, 65 | default=20, 66 | help='Magnification of WSI.') 67 | 68 | parser.add_argument('--model', 69 | type=str, 70 | default='resnet18', 71 | help='cnn architecture') 72 | 73 | parser.add_argument('--overlap', 74 | type=int, 75 | default=0, 76 | help='n pixels of tile overlap (for preprocess and pretile)') 77 | 78 | parser.add_argument('--batch_size', 79 | type=int, 80 | default=96, 81 | help='Batch size for inference or training.') 82 | 83 | parser.add_argument('--gpu', 84 | type=int, 85 | default=[0], 86 | nargs='+', 87 | help='Which GPU(s) to use for training.') 88 | 89 | parser.add_argument('--learning_rate', 90 | type=float, 91 | default=0.001, 92 | help='Learning rate for training.') 93 | 94 | parser.add_argument('--weight_decay', 95 | type=float, 96 | default=1e-4, 97 | help='Weight decay for Adam optimizer.') 98 | 99 | parser.add_argument('--num_epochs', 100 | type=int, 101 | default=20, 102 | help='Number of epochs to train.') 103 | 104 | parser.add_argument('--min_n_tiles', 105 | type=int, 106 | default=100, 107 | help='Minimum number of tiles per slide.') 108 | 109 | parser.add_argument('--tile_dir', 110 | type=str, 111 | default='pretilings_512', 112 | help='Directory to store/load tiles..') 113 | 114 | parser.add_argument('--val_pred_file', 115 | type=str, 116 | default='', 117 | help='Location of val pred .csv to load for eval.') 118 | 119 | 120 | args = parser.parse_args() 121 | 122 | args.preprocessed_cohort_csv_path = os.path.join(DATA_DIR, args.preprocessed_cohort_csv_path) 123 | args.cohort_csv_path = os.path.join(DATA_DIR, args.cohort_csv_path) 124 | 125 | desired_otsu_thumbnail_tile_size = 4 126 | -------------------------------------------------------------------------------- /tissue-type-training/confusion_matrix_analysis.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from matplotlib import pyplot as plt 4 | from sklearn.metrics import confusion_matrix 5 | import seaborn as sns 6 | from scipy.stats import binom_test 7 | 8 | 9 | FONTSIZE = 8 10 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE) 11 | plt.rc('xtick',labelsize=FONTSIZE) 12 | plt.rc('ytick',labelsize=FONTSIZE) 13 | plt.rc("axes", labelsize=FONTSIZE) 14 | 15 | 16 | 17 | def cm_analysis(y_true, y_pred, labels=None, ymap=None, figsize=(2.66,2.66), filename='confusion_matrix.pdf', acc_pval=False): 18 | """ 19 | Generate matrix plot of confusion matrix with pretty annotations. 20 | The plot image is saved to disk. 21 | args: 22 | y_true: true label of the data, with shape (nsamples,) 23 | y_pred: prediction of the data, with shape (nsamples,) 24 | filename: filename of figure file to save 25 | labels: string array, name the order of class labels in the confusion matrix. 26 | use `clf.classes_` if using scikit-learn models. 27 | with shape (nclass,). 28 | ymap: dict: any -> string, length == nclass. 29 | if not None, map the labels & ys to more understandable strings. 30 | Caution: original y_true, y_pred and labels must align. 31 | figsize: the size of the figure plotted. 32 | """ 33 | if ymap is not None: 34 | y_pred = [ymap[int(yi)] for yi in y_pred] 35 | y_true = [ymap[int(yi)] for yi in y_true] 36 | cm = confusion_matrix(y_true, y_pred, labels=labels) 37 | cm_sum = np.sum(cm, axis=1, keepdims=True) 38 | cm_perc = cm / cm_sum.astype(float) * 100 39 | annot = np.empty_like(cm).astype(str) 40 | nrows, ncols = cm.shape 41 | for i in range(nrows): 42 | for j in range(ncols): 43 | c = cm[i, j] 44 | p = cm_perc[i, j] 45 | if i == j: 46 | s = cm_sum[i] 47 | # annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s) 48 | annot[i, j] = '{:.0f}%'.format(p) 49 | elif c == 0: 50 | annot[i, j] = '' 51 | else: 52 | # annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s) 53 | annot[i, j] = '{:.0f}%'.format(p) 54 | 55 | cm = pd.DataFrame(cm_perc, index=labels, columns=labels) 56 | cm.index.name = 'Actual' 57 | cm.columns.name = 'Predicted' 58 | fig, ax = plt.subplots(figsize=figsize, constrained_layout=True) 59 | # sns.set(font_scale=1.6) 60 | sns.heatmap(cm, annot=annot, fmt='', ax=ax, square=True, cbar=False, annot_kws={'fontsize': FONTSIZE}) 61 | if acc_pval: 62 | y_pred = np.array(y_pred) 63 | y_true = np.array(y_true) 64 | _, counts = np.unique(y_true, return_counts=True) 65 | no_information_rate = np.max(counts) / float(len(y_true)) 66 | plt.text(0, 0, 'p = {:4.3f}'.format(binom_test(x=np.sum(y_pred == y_true), 67 | n=len(y_pred), 68 | p=no_information_rate))) 69 | plt.savefig(filename, dpi=300 if '.svg' not in filename else None) 70 | -------------------------------------------------------------------------------- /tissue-type-training/cross_validate_on_annotations.sh: -------------------------------------------------------------------------------- 1 | ARGS="--magnification 20 2 | --cohort_csv_path data/dataframes/tissuetype_hne_df.csv 3 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_tissuetype_hne_df.csv 4 | --tile_dir pretilings 5 | --tile_selection_type manual 6 | --otsu_threshold 0.2 7 | --batch_size 96 8 | --min_n_tiles 1 9 | --num_epochs 30 10 | --crossval 4 11 | --gpu 0 12 | --overlap 32 13 | --tile_size 64 14 | --normalize 15 | --model resnet18 16 | --experiment_name xval4 17 | --learning_rate 0.0005" 18 | 19 | python preprocess.py ${ARGS} 20 | 21 | python pretile.py ${ARGS} 22 | 23 | python train_tissue_tile_clf.py ${ARGS} 24 | 25 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold0_epoch021.torch" --val_pred_file "predictions/xval4_fold0_epoch021.torch_val.csv" 26 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold1_epoch021.torch" --val_pred_file "predictions/xval4_fold1_epoch021.torch_val.csv" 27 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold2_epoch021.torch" --val_pred_file "predictions/xval4_fold2_epoch021.torch_val.csv" 28 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold3_epoch021.torch" --val_pred_file "predictions/xval4_fold3_epoch021.torch_val.csv" 29 | 30 | python eval_tissue_tile.py ${ARGS} --val_pred_file "predictions/xval4_fold{}_epoch021.torch_val.csv" 31 | -------------------------------------------------------------------------------- /tissue-type-training/dataset.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import general_utils 3 | import torch 4 | 5 | from torch.utils.data import Dataset 6 | from PIL import Image, ImageFile 7 | 8 | 9 | ImageFile.LOAD_TRUNCATED_IMAGES = True 10 | 11 | 12 | class TissueTileDataset(Dataset): 13 | def __init__(self, df, tile_dir, transforms=None): 14 | self.df = df 15 | self.tile_dir = tile_dir 16 | self.transforms = transforms 17 | 18 | def __len__(self): 19 | return len(self.df) 20 | 21 | def __getitem__(self, item): 22 | row = self.df.iloc[item] 23 | tile = Image.open(row['tile_file_name']) 24 | 25 | if self.transforms: 26 | tile = self.transforms(tile) 27 | 28 | label = float(row['tile_class']) 29 | 30 | return '/'.join(row['tile_file_name'].split('/')[-2:]), tile.float(), label 31 | -------------------------------------------------------------------------------- /tissue-type-training/environment.yml: -------------------------------------------------------------------------------- 1 | name: transformer 2 | channels: 3 | - pytorch 4 | - conda-forge 5 | - defaults 6 | dependencies: 7 | - _libgcc_mutex=0.1=main 8 | - _openmp_mutex=4.5=1_gnu 9 | - astor=0.8.1=py39h06a4308_0 10 | - autograd=1.3=pyhd3eb1b0_1 11 | - autograd-gamma=0.5.0=pyh9f0ad1d_0 12 | - blas=1.0=mkl 13 | - blosc=1.21.0=h8c45485_0 14 | - bottleneck=1.3.2=py39hdd57654_1 15 | - brotli=1.0.9=he6710b0_2 16 | - brunsli=0.1=h2531618_0 17 | - bzip2=1.0.8=h7b6447c_0 18 | - c-ares=1.18.1=h7f8727e_0 19 | - ca-certificates=2022.2.1=h06a4308_0 20 | - cairo=1.16.0=hf32fb01_1 21 | - certifi=2021.10.8=py39h06a4308_2 22 | - cfitsio=3.470=hf0d0db6_6 23 | - charls=2.2.0=h2531618_0 24 | - cloudpickle=2.0.0=pyhd3eb1b0_0 25 | - cudatoolkit=10.2.89=hfd86e86_1 26 | - cycler=0.11.0=pyhd3eb1b0_0 27 | - cytoolz=0.11.0=py39h27cfd23_0 28 | - dask-core=2021.10.0=pyhd3eb1b0_0 29 | - ffmpeg=4.2.2=h20bf706_0 30 | - fontconfig=2.13.1=h6c09931_0 31 | - fonttools=4.25.0=pyhd3eb1b0_0 32 | - formulaic=0.2.4=pyhd8ed1ab_0 33 | - freetype=2.11.0=h70c0345_0 34 | - fsspec=2022.1.0=pyhd3eb1b0_0 35 | - future=0.18.2=py39h06a4308_1 36 | - gdk-pixbuf=2.42.6=h8cc273a_5 37 | - geos=3.8.0=he6710b0_0 38 | - giflib=5.2.1=h7b6447c_0 39 | - glib=2.69.1=h4ff587b_1 40 | - gmp=6.2.1=h2531618_2 41 | - gnutls=3.6.15=he1e5248_0 42 | - gobject-introspection=1.68.0=py39he41a700_3 43 | - icu=58.2=he6710b0_3 44 | - imagecodecs=2021.8.26=py39h4cda21f_0 45 | - imageio=2.9.0=pyhd3eb1b0_0 46 | - intel-openmp=2021.4.0=h06a4308_3561 47 | - interface_meta=1.2.4=pyhd8ed1ab_0 48 | - joblib=1.1.0=pyhd3eb1b0_0 49 | - jpeg=9d=h7f8727e_0 50 | - jxrlib=1.1=h7b6447c_2 51 | - kiwisolver=1.3.2=py39h295c915_0 52 | - krb5=1.19.2=hac12032_0 53 | - lame=3.100=h7b6447c_0 54 | - lcms2=2.12=h3be6417_0 55 | - ld_impl_linux-64=2.35.1=h7274673_9 56 | - lerc=3.0=h295c915_0 57 | - libaec=1.0.4=he6710b0_1 58 | - libcurl=7.80.0=h0b77cf5_0 59 | - libdeflate=1.8=h7f8727e_5 60 | - libedit=3.1.20210910=h7f8727e_0 61 | - libev=4.33=h7f8727e_1 62 | - libffi=3.3=he6710b0_2 63 | - libgcc-ng=9.3.0=h5101ec6_17 64 | - libgfortran-ng=7.5.0=ha8ba4b0_17 65 | - libgfortran4=7.5.0=ha8ba4b0_17 66 | - libgomp=9.3.0=h5101ec6_17 67 | - libidn2=2.3.2=h7f8727e_0 68 | - libnghttp2=1.46.0=hce63b2e_0 69 | - libopus=1.3.1=h7b6447c_0 70 | - libpng=1.6.37=hbc83047_0 71 | - libssh2=1.9.0=h1ba5d50_1 72 | - libstdcxx-ng=9.3.0=hd4cf53a_17 73 | - libtasn1=4.16.0=h27cfd23_0 74 | - libtiff=4.2.0=h85742a9_0 75 | - libunistring=0.9.10=h27cfd23_0 76 | - libuuid=1.0.3=h7f8727e_2 77 | - libuv=1.40.0=h7b6447c_0 78 | - libvpx=1.7.0=h439df22_0 79 | - libwebp=1.2.2=h55f646e_0 80 | - libwebp-base=1.2.2=h7f8727e_0 81 | - libxcb=1.14=h7b6447c_0 82 | - libxml2=2.9.12=h03d6c58_0 83 | - libzopfli=1.0.3=he6710b0_0 84 | - lifelines=0.26.5=pyhd8ed1ab_0 85 | - locket=0.2.1=py39h06a4308_1 86 | - lz4-c=1.9.3=h295c915_1 87 | - matplotlib-base=3.5.1=py39ha18d171_0 88 | - mkl=2021.4.0=h06a4308_640 89 | - mkl-service=2.4.0=py39h7f8727e_0 90 | - mkl_fft=1.3.1=py39hd3c417c_0 91 | - mkl_random=1.2.2=py39h51133e4_0 92 | - munkres=1.1.4=py_0 93 | - ncurses=6.3=h7f8727e_2 94 | - nettle=3.7.3=hbbd107a_1 95 | - networkx=2.6.3=pyhd3eb1b0_0 96 | - numexpr=2.8.1=py39h6abb31d_0 97 | - numpy=1.21.2=py39h20f2e39_0 98 | - numpy-base=1.21.2=py39h79a1101_0 99 | - openh264=2.1.1=h4ff587b_0 100 | - openjpeg=2.4.0=h3ad879b_0 101 | - openslide=3.4.1=h8137273_1 102 | - openslide-python=1.1.2=py39h3811e60_0 103 | - openssl=1.1.1m=h7f8727e_0 104 | - packaging=21.3=pyhd3eb1b0_0 105 | - pandas=1.3.5=py39h8c16a72_0 106 | - partd=1.2.0=pyhd3eb1b0_0 107 | - patsy=0.5.2=py39h06a4308_1 108 | - pcre=8.45=h295c915_0 109 | - pillow=9.0.1=py39h22f2fdc_0 110 | - pip=21.2.4=py39h06a4308_0 111 | - pixman=0.40.0=h7f8727e_1 112 | - pyparsing=3.0.4=pyhd3eb1b0_0 113 | - python=3.9.7=h12debd9_1 114 | - python-dateutil=2.8.2=pyhd3eb1b0_0 115 | - python_abi=3.9=2_cp39 116 | - pytorch=1.10.1=py3.9_cuda10.2_cudnn7.6.5_0 117 | - pytorch-mutex=1.0=cuda 118 | - pytz=2021.3=pyhd3eb1b0_0 119 | - pywavelets=1.1.1=py39h6323ea4_4 120 | - pyyaml=6.0=py39h7f8727e_1 121 | - readline=8.1.2=h7f8727e_1 122 | - scikit-image=0.18.3=py39h51133e4_0 123 | - scikit-learn=1.0.2=py39h51133e4_0 124 | - scipy=1.7.3=py39hc147768_0 125 | - seaborn=0.11.2=hd8ed1ab_0 126 | - seaborn-base=0.11.2=pyhd8ed1ab_0 127 | - setuptools=58.0.4=py39h06a4308_0 128 | - shapely=1.7.1=py39h1728cc4_0 129 | - six=1.16.0=pyhd3eb1b0_1 130 | - snappy=1.1.8=he6710b0_0 131 | - sqlite=3.37.2=hc218d9a_0 132 | - statsmodels=0.13.0=py39h7f8727e_0 133 | - threadpoolctl=2.2.0=pyh0d69192_0 134 | - tifffile=2021.7.2=pyhd3eb1b0_2 135 | - tk=8.6.11=h1ccaba5_0 136 | - toolz=0.11.2=pyhd3eb1b0_0 137 | - torchaudio=0.10.1=py39_cu102 138 | - torchvision=0.11.2=py39_cu102 139 | - typing_extensions=3.10.0.2=pyh06a4308_0 140 | - tzdata=2021e=hda174b7_0 141 | - wheel=0.37.1=pyhd3eb1b0_0 142 | - wrapt=1.13.3=py39h7f8727e_2 143 | - x264=1!157.20191217=h7b6447c_0 144 | - xz=5.2.5=h7b6447c_0 145 | - yaml=0.2.5=h7b6447c_0 146 | - zfp=0.5.5=h295c915_6 147 | - zlib=1.2.11=h7f8727e_4 148 | - zstd=1.4.9=haebb681_0 149 | - pip: 150 | - argparse==1.4.0 151 | - einops==0.3.2 152 | - nystrom-attention==0.0.11 153 | -------------------------------------------------------------------------------- /tissue-type-training/eval_tissue_tile.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import average_precision_score, accuracy_score, balanced_accuracy_score, confusion_matrix 2 | from sklearn.utils.class_weight import compute_class_weight 3 | import pandas as pd 4 | import config 5 | import json 6 | from openslide import OpenSlide 7 | from openslide.lowlevel import OpenSlideUnsupportedFormatError 8 | from PIL import Image 9 | import os 10 | import general_utils 11 | import numpy as np 12 | import matplotlib.pyplot as plt 13 | from joblib import Parallel, delayed 14 | from confusion_matrix_analysis import cm_analysis 15 | from general_utils import label_image_tissue_type, add_scale_bar 16 | 17 | 18 | FONTSIZE = 7 19 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE) 20 | plt.rc('xtick',labelsize=FONTSIZE) 21 | plt.rc('ytick',labelsize=FONTSIZE) 22 | plt.rc("axes", labelsize=FONTSIZE) 23 | 24 | 25 | def get_auprc(df): 26 | try: 27 | assert 'score_1' in df.columns 28 | assert 'label' in df.columns 29 | except AssertionError: 30 | raise AssertionError('label and pred_score not in {}'.format(df.columns)) 31 | preds = df['score_1'].tolist() 32 | is_hrd = df['label'].tolist() 33 | 34 | auprc = average_precision_score(y_true=is_hrd, y_score=preds) 35 | return {'auprc': auprc} 36 | 37 | 38 | def get_random_auprc_from_df(df): 39 | random_df = df.copy(deep=True) 40 | random_df['pred_score'] = np.random.permutation(random_df['pred_score'].values) 41 | return {'random_auprc': get_auprc(random_df)} 42 | 43 | 44 | def get_accuracy(df): 45 | return {'accuracy': accuracy_score(y_true=df.label, y_pred=df.predicted_class)} 46 | 47 | 48 | def get_all_single_class_auprc_values(df): 49 | d = {} 50 | for class_ in df.label.unique(): 51 | class_ = int(class_) 52 | #print('class {}'.format(class_)) 53 | is_truly_class = df['label'] == class_ 54 | df.loc[is_truly_class, 'temp_binary_truth'] = 1 55 | df.loc[~is_truly_class, 'temp_binary_truth'] = 0 56 | 57 | df['temp_binary_pred'] = df['score_{}'.format(class_)] 58 | d['auprc_{}'.format(class_)] = average_precision_score(y_true=df['temp_binary_truth'], 59 | y_score=df['temp_binary_pred']) 60 | df.drop(columns=['temp_binary_pred', 'temp_binary_truth'], inplace=True) 61 | return d 62 | 63 | 64 | def get_confusion_matrix(df): 65 | raise NotImplementedError 66 | return {'confusion_matrix': confusion_matrix(y_true=df.label, y_pred=df.predicted_class)} 67 | 68 | 69 | def visualize(df, pred_file): 70 | sub_dir = os.path.join('visualizations', pred_file.split('/')[-1].replace('.csv', '')) 71 | if not os.path.exists(sub_dir): 72 | os.mkdir(sub_dir) 73 | 74 | if len(df) == 0: 75 | return 76 | 77 | for image_path, sub_df in df.groupby('image_path'): 78 | _visualize(image_path, sub_df, sub_dir) 79 | 80 | 81 | def _visualize(image_path, sub_df, sub_dir): 82 | desired_otsu_thumbnail_tile_size = 8 # 16 83 | scale_factor = config.args.tile_size / desired_otsu_thumbnail_tile_size 84 | print(image_path) 85 | try: 86 | slide = OpenSlide(image_path) 87 | except OpenSlideUnsupportedFormatError: 88 | print(image_path) 89 | exit() 90 | slide_mag = general_utils.get_magnification(slide) 91 | if slide_mag != config.args.magnification: 92 | if (slide_mag / config.args.magnification) == 2: 93 | scale = scale_factor * 2 94 | elif (slide_mag / config.args.magnification) == 4: 95 | scale = scale_factor * 4 96 | else: 97 | raise AssertionError('Invalid scale') 98 | else: 99 | scale = scale_factor 100 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale) 101 | sub_df['address'] = sub_df['tile_file_name'].apply( 102 | lambda x: [int(y) for y in x.replace('.png', '').split('/')[1].split('_')]) 103 | 104 | # print(sub_df) 105 | thumbnail = general_utils.visualize_tile_scoring(thumbnail, 106 | desired_otsu_thumbnail_tile_size, 107 | sub_df.address.tolist(), 108 | sub_df.predicted_class.tolist(), 109 | overlap=int(config.args.overlap//scale_factor), 110 | range_=[0, 3]) 111 | thumbnail = Image.fromarray(thumbnail) 112 | thumbnail = label_image_tissue_type(thumbnail, map_reverse_key) 113 | thumbnail = add_scale_bar(thumbnail, scale, slide_mag) 114 | thumbnail.save('{}/{}_{}_eval.png'.format(sub_dir, image_path.split('/')[-1], set_name)) 115 | 116 | 117 | if __name__ == '__main__': 118 | all_preds = [] 119 | all_acc = [] 120 | for set_name, _pred_file in zip(['val'], [config.args.val_pred_file]): 121 | for fold in range(config.args.crossval): 122 | pred_file = _pred_file.format(fold) 123 | preds = pd.read_csv(pred_file) 124 | preds = preds.set_index('tile_file_name') 125 | n_classes = 4 126 | 127 | results = {} 128 | 129 | if n_classes == 2: 130 | results.update(get_auprc(preds)) 131 | results.update(get_random_auprc_from_df(preds)) 132 | print(preds) 133 | preds['predicted_class'] = preds.drop(columns='label').idxmax(axis='columns').str.replace( 134 | 'score_', '').astype(int) 135 | preds['certainty'] = preds.drop(columns=['label', 'predicted_class']).max(axis='columns') 136 | 137 | results.update(get_all_single_class_auprc_values(preds)) 138 | results.update(get_accuracy(preds)) 139 | all_acc.append(results['accuracy']) 140 | 141 | map_key = {0: 'Stroma', 1: 'Tumor', 2: 'Fat', 3: 'Necrosis'} 142 | map_reverse_key = dict([(v, k) for k, v in map_key.items()]) 143 | all_preds.append(preds) 144 | cm_analysis(y_true=preds.label, 145 | y_pred=preds.predicted_class, 146 | ymap=map_key, 147 | labels=['Stroma', 'Tumor', 'Fat', 'Necrosis'], 148 | filename='evals/{}'.format(pred_file.split('/')[-1].replace('.csv', '_confusion.png'))) 149 | 150 | with open('evals/{}'.format(pred_file.split('/')[-1].replace('.csv', '.txt')), 'w') as f: 151 | json.dump(results, f) 152 | preds = preds.reset_index() 153 | preds['image_id'] = preds['tile_file_name'].str.split('/').map(lambda x: str(x[0])) 154 | df = pd.read_csv(config.args.preprocessed_cohort_csv_path)[['image_path']] 155 | df['image_id'] = df['image_path'].str.split('/').map(lambda x: str(x[-1][:-4])) 156 | df = df.set_index('image_id') 157 | preds = preds.join(df, on='image_id', how='left').drop(columns=['image_id']) 158 | visualize(preds, pred_file) 159 | 160 | print('{:4.3f} +/- {:4.3f}'.format(np.mean(all_acc), np.std(all_acc))) 161 | all_preds = pd.concat(all_preds, axis=0) 162 | all_preds.to_csv('evals/all_preds_{}'.format(config.args.val_pred_file.split('/')[-1].format('_all'))) 163 | cm_analysis(y_true=all_preds.label, 164 | y_pred=all_preds.predicted_class, 165 | ymap=map_key, 166 | labels=['Stroma', 'Tumor', 'Fat', 'Necrosis'], 167 | filename='evals/integrated.svg') 168 | -------------------------------------------------------------------------------- /tissue-type-training/evals/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/evals/.gitkeep -------------------------------------------------------------------------------- /tissue-type-training/general_utils.py: -------------------------------------------------------------------------------- 1 | import colorsys 2 | import re 3 | 4 | from openslide import OpenSlide, ImageSlide 5 | from openslide.deepzoom import DeepZoomGenerator 6 | from skimage.color import rgb2gray 7 | from skimage.filters import threshold_otsu 8 | from torch import distributed as dist 9 | 10 | import numpy as np 11 | import pandas as pd 12 | import torch 13 | import os 14 | from random import choice 15 | from PIL import Image, ImageDraw, ImageFont 16 | from skimage.draw import rectangle_perimeter, rectangle 17 | from skimage import color 18 | from copy import deepcopy 19 | from datetime import datetime 20 | from torchvision import transforms 21 | from sklearn.model_selection import train_test_split, StratifiedKFold, KFold 22 | 23 | 24 | def get_magnification(slide): 25 | return int(slide.properties['aperio.AppMag']) 26 | 27 | 28 | def get_downscaled_thumbnail(slide, scale_factor=32): 29 | new_width = slide.dimensions[0] // scale_factor 30 | new_height = slide.dimensions[1] // scale_factor 31 | img = slide.get_thumbnail((new_width, new_height)) 32 | return np.array(img) 33 | 34 | 35 | def get_full_resolution_generator(slide, tile_size, overlap=0, level_offset=0): 36 | assert isinstance(slide, OpenSlide) or isinstance(slide, ImageSlide) 37 | generator = DeepZoomGenerator(slide, overlap=overlap, tile_size=tile_size, limit_bounds=False) 38 | generator_level = generator.level_count - 1 - level_offset 39 | if level_offset == 0: 40 | assert generator.level_dimensions[generator_level] == slide.dimensions 41 | return generator, generator_level 42 | 43 | 44 | def adjust_scale_for_slide_mag(slide_mag, desired_mag, scale): 45 | if slide_mag != desired_mag: 46 | if slide_mag < desired_mag: 47 | raise AssertionError('expected mag >={} but got {}'.format(desired_mag, slide_mag)) 48 | elif (slide_mag / desired_mag) == 2: 49 | scale *= 2 50 | elif (slide_mag / desired_mag) == 4: 51 | scale *= 4 52 | else: 53 | raise AssertionError('expected mag {} or {} but got {}'.format(desired_mag, 2 * desired_mag, slide_mag)) 54 | return scale 55 | 56 | 57 | def visualize_tiling(_thumbnail, tile_size, tile_addresses, overlap=0): 58 | """ 59 | Draw black boxes around tiles passing threshold 60 | :param _thumbnail: np.ndarray 61 | :param tile_size: int 62 | :param tile_addresses: 63 | :return: new thumbnail image with black boxes around tiles passing threshold 64 | """ 65 | assert isinstance(_thumbnail, np.ndarray) and isinstance(tile_size, int) 66 | thumbnail = deepcopy(_thumbnail) 67 | generator, generator_level = get_full_resolution_generator(array_to_slide(thumbnail), 68 | tile_size=tile_size, 69 | overlap=overlap) 70 | 71 | for address in tile_addresses: 72 | if isinstance(address, list): 73 | address = address[0] 74 | extent = generator.get_tile_dimensions(generator_level, address) 75 | start = (address[1] * tile_size, address[0] * tile_size) # flip because OpenSlide uses 76 | # (column, row), but skimage 77 | # uses (row, column) 78 | rr, cc = rectangle_perimeter(start=start, extent=extent, shape=thumbnail.shape) 79 | thumbnail[rr, cc] = 1 80 | 81 | return thumbnail 82 | 83 | 84 | def colorize(image, hue, saturation=1.0): 85 | """ Add color of the given hue to an RGB image. 86 | 87 | By default, set the saturation to 1 so that the colors pop! 88 | """ 89 | hsv = color.rgb2hsv(image) 90 | hsv[:, :, 1] = saturation 91 | hsv[:, :, 0] = hue 92 | rgb = (color.hsv2rgb(hsv) * 255).astype(int) 93 | return rgb 94 | 95 | 96 | def visualize_tile_scoring(_thumbnail, tile_size, tile_addresses, tile_scores, overlap=0, range_=[-3, 3]): 97 | """ 98 | Draw black boxes around tiles passing threshold 99 | :param _thumbnail: np.ndarray 100 | :param tile_size: int 101 | :param tile_addresses: 102 | :return: new thumbnail image with black boxes around tiles passing threshold 103 | """ 104 | denom = 2 * (range_[1] - range_[0]) 105 | assert isinstance(_thumbnail, np.ndarray) and isinstance(tile_size, int) 106 | thumbnail = deepcopy(_thumbnail) 107 | generator, generator_level = get_full_resolution_generator(array_to_slide(thumbnail), 108 | tile_size=tile_size, 109 | overlap=overlap) 110 | 111 | for address, score in zip(tile_addresses, tile_scores): 112 | extent = generator.get_tile_dimensions(generator_level, address) 113 | start = (address[1] * tile_size, address[0] * tile_size) # flip because OpenSlide uses 114 | # (column, row), but skimage 115 | # uses (row, column) 116 | 117 | rr, cc = rectangle(start=start, extent=extent, shape=thumbnail.shape) 118 | thumbnail[rr, cc] = colorize(thumbnail[rr, cc], hue=0.5-(score-range_[0])/denom, saturation=0.5) 119 | 120 | return thumbnail 121 | 122 | 123 | def array_to_slide(arr): 124 | assert isinstance(arr, np.ndarray) 125 | slide = ImageSlide(Image.fromarray(arr)) 126 | return slide 127 | 128 | 129 | def get_current_time(): 130 | return str(datetime.now()).replace(' ', '_').split('.')[0].replace(':', '.') 131 | 132 | 133 | def load_preprocessed_df(file_name, min_n_tiles, cols=None, seed=None, explode=True): 134 | if cols: 135 | df_ = pd.read_csv(file_name, low_memory=False, usecols=cols) 136 | else: 137 | df_ = pd.read_csv(file_name, low_memory=False) 138 | 139 | df_ = df_[df_.n_foreground_tiles >= min_n_tiles] 140 | 141 | df_.tile_address = df_.tile_address.map(eval) 142 | if explode: 143 | df_ = df_.explode('tile_address') 144 | df_['tile_file_name'] = df_.tile_address.apply( 145 | lambda x: str(x).split(')')[0].replace('(', '').replace(')', '').replace('[', '').replace(']', '').replace(', ', '_') + '.png') 146 | # lambda x: get_tile_file_name('---', x.img_hid, x.tile_address), 147 | # axis=1).str.split('/').map(lambda x: x[-1]) 148 | 149 | return df_ 150 | 151 | 152 | def k_fold_ptwise_crossval(df, k, seed): 153 | if 'observed' in df.columns: 154 | if 'Patient ID' in df.columns: 155 | temp_df = df.groupby('Patient ID').agg('mean').observed 156 | else: 157 | temp_df = df.groupby('image_path').agg('mean').observed 158 | 159 | patient_ids = np.array(temp_df.index.tolist()) 160 | observed = np.array(temp_df.tolist()) 161 | 162 | kf = StratifiedKFold(n_splits=k, random_state=seed % (2**32 - 1), shuffle=True).split(patient_ids, observed) 163 | else: 164 | if 'Patient ID' in df.columns: 165 | patient_ids = np.sort(df['Patient ID'].unique()) 166 | else: 167 | patient_ids = np.sort(df['image_path'].unique()) 168 | 169 | kf = KFold(n_splits=k, random_state=seed % (2**32 - 1), shuffle=True).split(patient_ids) 170 | 171 | df_list = [] 172 | for train_indices, test_indices in kf: 173 | train_labels = patient_ids[train_indices] 174 | test_labels = patient_ids[test_indices] 175 | # train_labels = list(set(DF.index[train_indices].tolist())) 176 | # test_labels = list(set(DF.index.tolist()) - set(train_labels)) 177 | if 'Patient ID' in df.columns: 178 | train_mask = df['Patient ID'].isin(train_labels) 179 | test_mask = df['Patient ID'].isin(test_labels) 180 | else: 181 | train_mask = df['image_path'].isin(train_labels) 182 | test_mask = df['image_path'].isin(test_labels) 183 | 184 | assert test_mask.sum() > 0 185 | assert train_mask.sum() > 0 186 | 187 | DF = df.copy(deep=True) 188 | DF.loc[train_mask, 'split'] = 'train' 189 | DF.loc[test_mask, 'split'] = 'val' 190 | df_list.append(DF) 191 | return df_list 192 | 193 | 194 | def get_slide_dir(_dir, slide_file_name): 195 | slide_stem = slide_file_name[:-4] 196 | return os.path.join(_dir, slide_stem) 197 | 198 | 199 | def get_tile_file_name(_dir, slide_file_name, address): 200 | address_suffix = str(address).replace('(', '').replace(')', '').replace(', ', '_') 201 | file_name = address_suffix + '.png' 202 | return os.path.join(get_slide_dir(_dir, slide_file_name), file_name) 203 | 204 | 205 | def load_ddp_state_dict_to_device(path, device='cpu', ddp_to_serial=True): 206 | assert isinstance(device, str) 207 | ddp_state_dict = torch.load(path, map_location=device) 208 | if ddp_to_serial: 209 | state_dict = {} 210 | for key, value in ddp_state_dict.items(): 211 | state_dict[key.replace('module.', '')] = value 212 | return state_dict 213 | else: 214 | return ddp_state_dict 215 | 216 | 217 | def setup(rank, world_size): 218 | # initialize the process group 219 | os.environ['MASTER_ADDR'] = 'localhost' 220 | try: 221 | os.environ['MASTER_PORT'] = '12355' 222 | dist.init_process_group("nccl", rank=rank, world_size=world_size) 223 | except RuntimeError: 224 | os.environ['MASTER_PORT'] = '1234' 225 | dist.init_process_group("nccl", rank=rank, world_size=world_size) 226 | 227 | print('device {} initialized'.format(rank)) 228 | 229 | 230 | def cleanup(): 231 | dist.destroy_process_group() 232 | 233 | 234 | def get_starting_timestamp(config): 235 | if config.args.checkpoint_path: 236 | short_path = config.args.checkpoint_path.split('/')[-1] 237 | starting_timestamp_ = re.search('^.*(?=(\_epoch))', short_path).group(0) 238 | if '_fold' in starting_timestamp_: 239 | starting_timestamp_ = starting_timestamp_.split('_fold')[0] 240 | else: 241 | starting_timestamp_ = get_current_time() 242 | return starting_timestamp_ 243 | 244 | 245 | def load_model_state_dict(model, checkpoint_path, device='cpu'): 246 | assert isinstance(device, str) 247 | model.load_state_dict(load_ddp_state_dict_to_device( 248 | os.path.join('checkpoints', checkpoint_path), device=device)) 249 | 250 | 251 | def get_starting_epoch(config): 252 | if config.args.checkpoint_path: 253 | starting_epoch = int(re.search('epoch(\d+)', config.args.checkpoint_path).group(1)) 254 | else: 255 | starting_epoch = 1 256 | return starting_epoch 257 | 258 | 259 | def log_results_string(epoch, starting_epoch, train_loss, val_loss, starting_timestamp, fold, other_keys=dict()): 260 | epoch_str = get_epoch_str(epoch, starting_epoch) 261 | results_str = 'epoch {}: train loss {:.3e} | val loss {:.3e}'.format( 262 | epoch_str, train_loss, val_loss) 263 | if other_keys: 264 | other_keys = list(other_keys.items()) 265 | other_keys.sort(key=lambda x: x[0]) 266 | for key, val in other_keys: 267 | results_str += ' | {} {:.3e}'.format(key, val) 268 | print(results_str) 269 | with open('checkpoints/{}_fold{}_log.txt'.format(starting_timestamp, fold), 'a+') as file: 270 | file.write(results_str + '\n') 271 | 272 | 273 | def get_train_transforms(smaller_dim=None, normalize=True): 274 | l = [] 275 | if smaller_dim: 276 | l.append(transforms.RandomCrop(size=smaller_dim)) 277 | 278 | l.extend([ 279 | transforms.RandomHorizontalFlip(0.5), 280 | transforms.RandomVerticalFlip(0.5), 281 | transforms.ColorJitter(brightness=0.1, 282 | contrast=0.1, 283 | saturation=0.05, 284 | hue=0.01), 285 | transforms.ToTensor()] 286 | ) 287 | if normalize: 288 | l.append(transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))) 289 | return transforms.Compose(l) 290 | 291 | 292 | def get_val_transforms(smaller_dim=None, normalize=True): 293 | l = [] 294 | if smaller_dim: 295 | l.append(transforms.RandomCrop(size=smaller_dim)) 296 | l.append(transforms.ToTensor()) 297 | if normalize: 298 | l.append(transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))) 299 | return transforms.Compose(l) 300 | 301 | 302 | def get_epoch_str(epoch, starting_epoch): 303 | return str(epoch + 1 + starting_epoch).zfill(3) 304 | 305 | 306 | def get_checkpoint_path(starting_timestamp, epoch, starting_epoch, fold): 307 | epoch_str = get_epoch_str(epoch, starting_epoch) 308 | return 'checkpoints/{}_fold{}_epoch{}.torch'.format(starting_timestamp, fold, epoch_str) 309 | 310 | 311 | def make_otsu(img, scale=1): 312 | """ 313 | Make image with pixel-wise foreground/background labels. 314 | :param img: grayscale np.ndarray 315 | :return: np.ndarray where each pixel is 0 if background and 1 if foreground 316 | """ 317 | assert isinstance(img, np.ndarray) 318 | _img = rgb2gray(img) 319 | threshold = threshold_otsu(_img) 320 | return (_img < (threshold * scale)).astype(float) 321 | 322 | 323 | def label_image_tissue_type(thumbnail, map_key): 324 | """ 325 | Labels tissue with hue overlay based on predicted classes. 326 | """ 327 | vals = list(map_key.values()) 328 | colors = [] 329 | range_ = [np.min(vals), np.max(vals)] 330 | denom = 2 * (range_[1] - range_[0]) 331 | for class_, score in map_key.items(): 332 | colors.append(tuple([int(255 * x) for x in 333 | colorsys.hsv_to_rgb(0.5 - (score - range_[0]) / denom, 0.5, 1.0)])) 334 | d = ImageDraw.Draw(thumbnail) 335 | 336 | text_locations = [(10, 10+40*x) for x in range(len(map_key))] 337 | for (class_, score), text_location in zip(map_key.items(), text_locations): 338 | d.text(text_location, class_, fill=colors[score]) 339 | return thumbnail 340 | 341 | 342 | def get_fold_slides(df, world_size, rank): 343 | all_slides = df.image_id.unique() 344 | chunks = np.array_split(all_slides, world_size) 345 | return chunks[rank] 346 | 347 | 348 | def add_scale_bar(thumbnail, scale, slide_mag, len_in_um=1000): 349 | if slide_mag == 20: 350 | um_per_pix = 0.5 351 | elif slide_mag == 40: 352 | um_per_pix = 0.25 353 | else: 354 | raise RuntimeError("Unhandled slide mag {}x".format(slide_mag)) 355 | 356 | um_per_pix *= scale 357 | 358 | len_in_pixels = len_in_um / float(um_per_pix) 359 | 360 | endpoints = [(10, 10+40*6), (10+len_in_pixels, 10+40*6)] 361 | d = ImageDraw.Draw(thumbnail) 362 | d.line(endpoints, fill='black', width=5) 363 | return thumbnail 364 | -------------------------------------------------------------------------------- /tissue-type-training/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torchvision.models 4 | from torchvision.models import resnet18, resnet34, resnet50, squeezenet1_1, vgg19_bn 5 | 6 | 7 | class TissueTileNet(nn.Module): 8 | def __init__(self, model, n_classes, activation=None): 9 | super(TissueTileNet, self).__init__() 10 | if type(model) in [torchvision.models.resnet.ResNet]: 11 | model.fc = nn.Linear(512, n_classes) 12 | elif type(model) == torchvision.models.squeezenet.SqueezeNet: 13 | list(model.children())[1][1] = nn.Conv2d(512, n_classes, kernel_size=1, stride=1) 14 | else: 15 | raise NotImplementedError 16 | self.model = model 17 | self.activation = activation 18 | 19 | def forward(self, x): 20 | y = self.model(x) 21 | if self.activation: 22 | y = self.activation(y) 23 | 24 | return y 25 | 26 | 27 | def get_model(cf): 28 | if cf.args.model == 'resnet18': 29 | return resnet18(pretrained=True) 30 | elif cf.args.model == 'resnet34': 31 | return resnet34(pretrained=True) 32 | elif cf.args.model == 'resnet50': 33 | return resnet50(pretrained=True) 34 | elif cf.args.model == 'squeezenet': 35 | return squeezenet1_1(pretrained=True) 36 | elif cf.args.model == 'vgg19': 37 | return vgg19_bn(pretrained=True) 38 | elif cf.args.model == 'tissue-type': 39 | model = load_tissue_tile_net() 40 | model = model.model 41 | for idx, child in enumerate(model.children()): 42 | if idx in [0, 1, 2, 3, 4, 5, 6, 7]: # 7 is last res block, 9 is fc layer 43 | for param in child.parameters(): 44 | param.requires_grad = False 45 | return model 46 | else: 47 | raise RuntimeError("Model type {} unknown".format(cf.model)) 48 | 49 | def load_tissue_tile_net(checkpoint_path='', activation=None, n_classes=4): 50 | model = TissueTileNet(resnet18(), n_classes, activation=activation) 51 | model.load_state_dict(torch.load( 52 | checkpoint_path, 53 | map_location='cpu')) 54 | return model 55 | -------------------------------------------------------------------------------- /tissue-type-training/pred_tissue_tile.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import os 3 | import re 4 | 5 | from torch.utils.data import DataLoader 6 | 7 | import config 8 | import general_utils 9 | from dataset import TissueTileDataset 10 | from train_tissue_tile_clf import make_preds, prep_df 11 | from models import TissueTileNet, get_model 12 | 13 | if __name__ == '__main__': 14 | assert config.args.checkpoint_path 15 | assert len(config.args.gpu) == 1 16 | 17 | device_str = 'cuda:{}'.format(config.args.gpu[0]) 18 | device = torch.device(device_str) 19 | 20 | num_workers = 8 21 | transforms = general_utils.get_val_transforms() 22 | seed = 1123011750 23 | 24 | df_, n_classes, map_key, map_reverse_key = prep_df(config.args.preprocessed_cohort_csv_path, 25 | tile_dir=config.args.tile_dir) 26 | fold = int(config.args.checkpoint_path.split('fold')[1][0]) 27 | df = general_utils.k_fold_ptwise_crossval(df_, config.args.crossval, seed)[fold] 28 | 29 | model = TissueTileNet(model=get_model(config), 30 | n_classes=n_classes, 31 | activation=torch.nn.Softmax(dim=1)) 32 | model.load_state_dict(general_utils.load_ddp_state_dict_to_device(config.args.checkpoint_path)) 33 | model.to(device) 34 | 35 | 36 | print('making val preds for fold {}'.format(fold)) 37 | val_dataset = TissueTileDataset(df=df[df.split == 'val'], 38 | tile_dir=config.args.tile_dir, 39 | transforms=transforms) 40 | assert len(val_dataset) > 0 41 | val_loader = DataLoader(val_dataset, 42 | batch_size=config.args.batch_size, 43 | num_workers=num_workers) 44 | make_preds(model, 45 | val_loader, 46 | device, 47 | config.args.val_pred_file, 48 | n_classes) 49 | -------------------------------------------------------------------------------- /tissue-type-training/predictions/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/predictions/.gitkeep -------------------------------------------------------------------------------- /tissue-type-training/preprocess.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import yaml 4 | import pandas as pd 5 | import numpy as np 6 | 7 | from openslide import OpenSlide 8 | from openslide.lowlevel import OpenSlideUnsupportedFormatError 9 | from shapely.geometry import Polygon, Point 10 | from PIL import Image 11 | from joblib import Parallel, delayed 12 | from skimage.color import rgb2lab 13 | 14 | import config 15 | import general_utils 16 | import itertools 17 | 18 | from general_utils import make_otsu 19 | 20 | 21 | def percent_otsu_score(tile): 22 | """ 23 | Get percent foreground score. 24 | :param tile: PIL.Image (greyscale) 25 | :return: float [0,1] of percent foregound in tile 26 | """ 27 | assert isinstance(tile, Image.Image) 28 | arr = np.array(tile) 29 | return np.mean(arr) 30 | 31 | 32 | def purple_score(tile_): 33 | """ 34 | Get percent purple score. 35 | :param tile_: PIL.Image (RGB) 36 | :return: float [0, 1] of percent purple pixels in tile 37 | """ 38 | assert isinstance(tile_, Image.Image) 39 | tile = np.array(tile_) 40 | r, g, b = tile[..., 0], tile[..., 1], tile[..., 2] 41 | score = np.sum((r > (g + 10)) & (b > (g + 10))) 42 | return score / tile.size 43 | 44 | 45 | def score_tiles(otsu_img, rgb_img, tile_size): 46 | """ 47 | Get scores for tiles based on percent foreground. When tile_size and img size are downscaled 48 | proportionally, these coordinates map directly into the slide with proportionately upscaled 49 | tile_size and img size. 50 | :param otsu_img: np.ndarray, possibly downsampled. binary thresholded 51 | :param rgb_img: np.ndarray, possibly downsampled. RGB. same size as otsu_img 52 | :param tile_size: side length 53 | :return: list of (int_x, int_y) tuples 54 | """ 55 | assert isinstance(otsu_img, np.ndarray) and isinstance(tile_size, int) 56 | otsu_slide = general_utils.array_to_slide(otsu_img) 57 | otsu_generator, otsu_generator_level = general_utils.get_full_resolution_generator(otsu_slide, 58 | tile_size=tile_size) 59 | rgb_slide = general_utils.array_to_slide(rgb_img) 60 | rgb_generator, rgb_generator_level = general_utils.get_full_resolution_generator(rgb_slide, 61 | tile_size=tile_size) 62 | 63 | tile_x_count, tile_y_count = otsu_generator.level_tiles[otsu_generator_level] 64 | address_list = [] 65 | for count, address in enumerate(itertools.product(range(tile_x_count), range(tile_y_count))): 66 | if count > 1000 and len(address_list) > 10 and PROTOTYPE: 67 | break 68 | dimensions = otsu_generator.get_tile_dimensions(otsu_generator_level, address) 69 | if not (dimensions[0] == tile_size) or not (dimensions[1] == tile_size): 70 | continue 71 | 72 | rgb_tile = rgb_generator.get_tile(rgb_generator_level, address) 73 | if not purple_score(rgb_tile) > config.args.purple_threshold: 74 | continue 75 | 76 | otsu_tile = otsu_generator.get_tile(otsu_generator_level, address) 77 | otsu_score = percent_otsu_score(otsu_tile) 78 | 79 | if otsu_score < config.args.otsu_threshold: 80 | continue 81 | 82 | address_list.append(address) 83 | 84 | return address_list 85 | 86 | 87 | def score_tiles_manual(ann_file_name, otsu_img, thumbnail, tile_size, overlap): 88 | assert isinstance(thumbnail, np.ndarray) and isinstance(tile_size, int) 89 | 90 | try: 91 | with open(ann_file_name, 'r') as f: 92 | slide_annotations = json.load(f)['features'] 93 | except FileNotFoundError: 94 | print('Warning: {} not found'.format(ann_file_name)) 95 | return [] 96 | 97 | slide = general_utils.array_to_slide(thumbnail) 98 | generator, generator_level = general_utils.get_full_resolution_generator(slide, 99 | tile_size=tile_size, 100 | overlap=overlap) 101 | otsu_slide = general_utils.array_to_slide(otsu_img) 102 | otsu_generator, otsu_generator_level = general_utils.get_full_resolution_generator(otsu_slide, 103 | tile_size=tile_size, 104 | overlap=overlap) 105 | tile_x_count, tile_y_count = generator.level_tiles[generator_level] 106 | print('{}, {}'.format(tile_x_count, tile_y_count)) 107 | address_list = [] 108 | for count, address in enumerate(itertools.product(range(tile_x_count), range(tile_y_count))): 109 | if count > 1000 and len(address_list) > 10 and PROTOTYPE: 110 | break 111 | dimensions = generator.get_tile_dimensions(generator_level, address) 112 | assert isinstance(tile_size, int) 113 | if dimensions[0] != (tile_size + 2*overlap) or dimensions[1] != (tile_size + 2*overlap): 114 | continue 115 | 116 | tile_location, _level_, new_tile_size = generator.get_tile_coordinates(generator_level, address) 117 | assert _level_ == 0 118 | tile_class = is_tile_in_annotations(tile_location, new_tile_size, slide_annotations) 119 | if not tile_class: 120 | continue 121 | else: 122 | if int(tile_class) in [3, 4, 6, 7]: 123 | otsu_tile = otsu_generator.get_tile(otsu_generator_level, address) 124 | otsu_score = percent_otsu_score(otsu_tile) 125 | 126 | if otsu_score < config.args.otsu_threshold: 127 | continue 128 | address_list.append([address, tile_class]) 129 | return address_list 130 | 131 | 132 | def is_tile_in_annotations(tile_location, tile_size, slide_annotations): 133 | """ 134 | Determine whether tile is in annotations, and if so, what class of annotation. 135 | :param tile_location: 136 | :param tile_size: 137 | :param slide_annotations: 138 | :return: 0 if not in annotation, else annotation index of first region that tile falls in 139 | """ 140 | points = [Point(tile_location[0], tile_location[1]), 141 | Point(tile_location[0] + tile_size[0], tile_location[1]), 142 | Point(tile_location[0], tile_location[1] + tile_size[1]), 143 | Point(tile_location[0] + tile_size[0], tile_location[1] + tile_size[1])] 144 | for annotation_ in slide_annotations: 145 | point_count = 0 146 | class_ = annotation_['properties']['label_num'] 147 | assert annotation_['geometry']['type'] == 'Polygon' 148 | 149 | coords = annotation_['geometry']['coordinates'] 150 | if len(coords) >= 3: 151 | annotation = Polygon(coords) 152 | for point in points: 153 | if annotation.contains(point): 154 | point_count += 1 155 | if point_count > 3: 156 | return class_ 157 | return 0 158 | 159 | 160 | def get_slide_tile_addresses(image_path, mag, scale, desired_tile_selection_size, index, tile_selection, overlap, annotation_path=None, visualize=False): 161 | assert isinstance(overlap, int) 162 | 163 | slide = OpenSlide(image_path) 164 | 165 | slide_mag = general_utils.get_magnification(slide) 166 | scale = general_utils.adjust_scale_for_slide_mag(slide_mag=slide_mag, desired_mag=mag, scale=scale) 167 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale) 168 | overlap = int(overlap // scale) 169 | otsu_thumbnail = make_otsu(thumbnail) 170 | if tile_selection == 'otsu': 171 | assert config.args.overlap == 0 172 | tile_addresses = score_tiles(otsu_thumbnail, 173 | thumbnail, 174 | tile_size=desired_tile_selection_size) 175 | tile_addresses = [[x, -1] for x in tile_addresses] # for consistent formatting with manual 176 | elif tile_selection == 'manual': 177 | assert annotation_path is not None 178 | tile_addresses = score_tiles_manual(annotation_path, 179 | otsu_thumbnail, 180 | thumbnail, 181 | tile_size=desired_tile_selection_size, 182 | overlap=overlap) 183 | else: 184 | raise RuntimeError 185 | 186 | if visualize: 187 | thumbnail = general_utils.visualize_tiling(thumbnail, 188 | desired_tile_selection_size, 189 | tile_addresses, 190 | overlap=overlap) 191 | thumbnail = Image.fromarray(thumbnail) 192 | thumbnail.save('tiling_visualizations/{}_tiling.png'.format(image_path.split('/')[-1])) 193 | return {index: tile_addresses} 194 | 195 | 196 | if __name__ == '__main__': 197 | PROTOTYPE = False 198 | visualize = False 199 | serial = False 200 | 201 | scale_factor = config.args.tile_size / config.desired_otsu_thumbnail_tile_size 202 | 203 | df = pd.read_csv(config.args.cohort_csv_path) 204 | with open('../global_config.yaml', 'r') as f: 205 | DIRECTORIES = yaml.safe_load(f) 206 | DATA_DIR = DIRECTORIES['data_dir'] 207 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x)) 208 | if 'segmentation_path' in df.columns: 209 | df['segmentation_path'] = df['segmentation_path'].apply(lambda x: os.path.join(DATA_DIR, x)) 210 | 211 | if PROTOTYPE: 212 | df = df.head(4) 213 | 214 | coords = {} 215 | if serial: 216 | for index, row in df.iterrows(): 217 | print(row['image_path']) 218 | tile_addresses = get_slide_tile_addresses(row['image_path'], 219 | mag=config.args.magnification, 220 | scale=scale_factor, 221 | desired_tile_selection_size= 222 | config.desired_otsu_thumbnail_tile_size, 223 | index=index, 224 | tile_selection=config.args.tile_selection_type, 225 | overlap=config.args.overlap, 226 | visualize=visualize, 227 | annotation_path=row['segmentation_path'] if 'segmentation_path' in df.columns else None) 228 | coords.update(tile_addresses) 229 | else: 230 | _dicts = Parallel(n_jobs=64)(delayed(get_slide_tile_addresses)(row['image_path'], 231 | mag=config.args.magnification, 232 | scale=scale_factor, 233 | desired_tile_selection_size= 234 | config.desired_otsu_thumbnail_tile_size, 235 | index=index, 236 | tile_selection=config.args.tile_selection_type, 237 | overlap=config.args.overlap, 238 | visualize=visualize, 239 | annotation_path=row['segmentation_path'] if 'segmentation_path' in df.columns else None) 240 | for index, row in df.iterrows()) 241 | for _dict in _dicts: 242 | coords.update(_dict) 243 | 244 | df['tile_address'] = pd.Series(coords) 245 | df['n_foreground_tiles'] = df['tile_address'].map(len) 246 | 247 | df.to_csv(config.args.preprocessed_cohort_csv_path, index=False) 248 | -------------------------------------------------------------------------------- /tissue-type-training/pretile.py: -------------------------------------------------------------------------------- 1 | import general_utils 2 | import config 3 | import os 4 | import numpy as np 5 | import yaml 6 | 7 | from PIL import Image 8 | from openslide import OpenSlide 9 | from openslide.lowlevel import OpenSlideUnsupportedFormatError 10 | from openslide.deepzoom import DeepZoomGenerator 11 | from joblib import delayed, Parallel 12 | from skimage.color import rgb2lab, lab2rgb 13 | 14 | from general_utils import make_otsu 15 | 16 | 17 | # normalization tools from https://github.com/CODAIT/deep-histopath/blob/master/deephistopath/preprocessing.py 18 | stain_ref = np.array([[0.56237296, 0.38036293], 19 | [0.72830425, 0.83254214], 20 | [0.39154767, 0.40273766]]) 21 | max_sat_ref = np.array([[0.62245465], 22 | [0.44427557]]) 23 | 24 | beta = 0.15 25 | alpha = 1 26 | light_intensity = 255 27 | 28 | 29 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py) 30 | def get_standard_luminosity_limit(rgb): 31 | assert isinstance(rgb, np.ndarray) 32 | lab = rgb2lab(rgb) 33 | p = np.percentile(lab[:, :, 0], 95) 34 | return p 35 | 36 | 37 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py) 38 | def apply_standard_luminosity_limit(rgb, p): 39 | assert isinstance(rgb, np.ndarray) 40 | lab = rgb2lab(rgb) 41 | lab[:, :, 0] = np.clip(100 * lab[:, :, 0] / p, 0, 100) 42 | return np.round(np.clip(255 * lab2rgb(lab), 0, 255)).astype(np.uint8) 43 | 44 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py) 45 | def calculate_macenko_transform(to_transform): 46 | assert isinstance(to_transform, np.ndarray) 47 | 48 | c = to_transform.shape[2] 49 | assert c == 3 50 | 51 | luminosity_limit = get_standard_luminosity_limit(to_transform) 52 | to_transform = apply_standard_luminosity_limit(to_transform, luminosity_limit) 53 | 54 | im = rgb2lab(to_transform) 55 | ignore_mask = ((im[:, :,0] < 50) & (np.abs(im[:,:,1] - im[:,:,2]) < 30)) | (im[:,:,2] < -40) | (im[:,:,0] > 90) | (im[:,:,1] > 40) 56 | to_transform = to_transform[~ignore_mask, :] 57 | 58 | to_transform = to_transform.reshape(-1, c).astype(np.float64) # shape (H*W, C) 59 | 60 | # Convert RGB to OD. 61 | OD = -np.log10(to_transform/light_intensity + 1e-8) 62 | 63 | # Remove data with OD intensity less than beta. 64 | OD_thresh = OD[np.all(OD >= beta, 1), :] 65 | 66 | # Calculate eigenvectors. 67 | U, s, V = np.linalg.svd(OD_thresh, full_matrices=False) 68 | 69 | # Extract two largest eigenvectors. 70 | top_eigvecs = V[0:2, :].T * -1 # shape (C, 2) 71 | 72 | # Project thresholded optical density values onto plane spanned by 73 | # 2 largest eigenvectors. 74 | proj = np.dot(OD_thresh, top_eigvecs) # shape (K, 2) 75 | 76 | # Calculate angle of each point wrt the first plane direction. 77 | # Note: the parameters are `np.arctan2(y, x)` 78 | angles = np.arctan2(proj[:, 1], proj[:, 0]) # shape (K,) 79 | 80 | # Find robust extremes (a and 100-a percentiles) of the angle. 81 | min_angle = np.percentile(angles, alpha) 82 | max_angle = np.percentile(angles, 100 - alpha) 83 | 84 | # Convert min/max vectors (extremes) back to optimal stains in OD space. 85 | # This computes a set of axes for each angle onto which we can project 86 | # the top eigenvectors. This assumes that the projected values have 87 | # been normalized to unit length. 88 | extreme_angles = np.array( 89 | [[np.cos(min_angle), np.cos(max_angle)], 90 | [np.sin(min_angle), np.sin(max_angle)]] 91 | ) # shape (2,2) 92 | stains = np.dot(top_eigvecs, extreme_angles) # shape (C, 2) 93 | 94 | # Merge vectors with hematoxylin first, and eosin second, as a heuristic. 95 | if stains[0, 0] < stains[0, 1]: 96 | stains[:, [0, 1]] = stains[:, [1, 0]] # swap columns 97 | 98 | # Calculate saturations of each stain. 99 | # Note: Here, we solve 100 | # OD = VS 101 | # S = V^{-1}OD 102 | # where `OD` is the matrix of optical density values of our image, 103 | # `V` is the matrix of stain vectors, and `S` is the matrix of stain 104 | # saturations. Since this is an overdetermined system, we use the 105 | # least squares solver, rather than a direct solve. 106 | sats, _, _, _ = np.linalg.lstsq(stains, OD.T, rcond=None) 107 | 108 | # Normalize stain saturations to have same pseudo-maximum based on 109 | # a reference max saturation. 110 | max_sat = np.percentile(sats, 99, axis=1, keepdims=True) 111 | return stains, max_sat, luminosity_limit 112 | 113 | 114 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py) 115 | def apply_macenko_transform(stains, max_sat, luminosity_limit, to_transform): 116 | assert isinstance(to_transform, np.ndarray) 117 | 118 | h, w, c = to_transform.shape 119 | assert c == 3 120 | 121 | to_transform = apply_standard_luminosity_limit(to_transform, luminosity_limit) 122 | 123 | to_transform = to_transform.reshape(-1, c).astype(np.float64) # shape (H*W, C) 124 | 125 | # Convert RGB to OD. 126 | OD = -np.log10(to_transform/light_intensity + 1e-8) 127 | 128 | # Calculate saturations of each stain. 129 | # Note: Here, we solve 130 | # OD = VS 131 | # S = V^{-1}OD 132 | # where `OD` is the matrix of optical density values of our image, 133 | # `V` is the matrix of stain vectors, and `S` is the matrix of stain 134 | # saturations. Since this is an overdetermined system, we use the 135 | # least squares solver, rather than a direct solve. 136 | sats, _, _, _ = np.linalg.lstsq(stains, OD.T, rcond=None) 137 | 138 | # Normalize stain saturations to have same pseudo-maximum based on 139 | # a reference max saturation. 140 | sats = sats / max_sat * max_sat_ref 141 | 142 | # Compute optimal OD values. 143 | OD_norm = np.dot(stain_ref, sats) 144 | 145 | # Recreate image. 146 | # Note: If the image is immediately converted to uint8 with `.astype(np.uint8)`, it will 147 | # not return the correct values due to the initial values being outside of [0,255]. 148 | # To fix this, we round to the nearest integer, and then clip to [0,255], which is the 149 | # same behavior as Matlab. 150 | # x_norm = np.exp(OD_norm) * light_intensity # natural log approach 151 | x_norm = 10 ** (-OD_norm) * light_intensity - 1e-8 # log10 approach 152 | x_norm = np.clip(np.round(x_norm), 0, 255).astype(np.uint8) 153 | x_norm = x_norm.astype(np.uint8) 154 | x_norm = x_norm.T.reshape(h, w, c) 155 | return x_norm 156 | 157 | 158 | def pretile_slide(row, tile_size, tile_dir, normalize=False, overlap=0): 159 | slide_dir = os.path.join(tile_dir, row['image_path'].split('/')[-1].replace('.svs', '')) 160 | if os.path.exists(slide_dir): 161 | n_tiles_saved = len(os.listdir(slide_dir)) 162 | if n_tiles_saved == row.n_foreground_tiles: 163 | print("{} fully tiled; skipping".format(slide_dir)) 164 | return 165 | else: 166 | os.mkdir(slide_dir) 167 | 168 | slide = OpenSlide(row['image_path']) 169 | 170 | slide_mag = general_utils.get_magnification(slide) 171 | if slide_mag == config.args.magnification: 172 | level_offset = 0 173 | elif slide_mag == 2 * config.args.magnification: 174 | level_offset = 1 175 | elif slide_mag == 4 * config.args.magnification: 176 | level_offset = 2 177 | else: 178 | raise NotImplementedError 179 | 180 | if normalize: 181 | size0, size1 = slide.dimensions 182 | stains, max_sat, luminosity_limit = calculate_macenko_transform( 183 | np.array(slide.get_thumbnail((size0//16, size1//16)))) 184 | else: 185 | stains, max_sat, luminosity_limit = None, None, None 186 | 187 | addresses = row.tile_address 188 | generator, level = general_utils.get_full_resolution_generator(slide, tile_size=tile_size, 189 | level_offset=level_offset, 190 | overlap=overlap) 191 | for address, class_ in addresses: 192 | tile_file_name = os.path.join(slide_dir, 193 | str(address).replace('(', '').replace(')', '').replace(', ', '_') + '.png') 194 | if os.path.exists(tile_file_name): 195 | continue 196 | tile = generator.get_tile(level, address) 197 | 198 | if normalize: 199 | tile = apply_macenko_transform(stains, max_sat, luminosity_limit, np.array(tile)) 200 | tile = Image.fromarray(tile) 201 | tile.save(tile_file_name) 202 | 203 | 204 | if __name__ == '__main__': 205 | serial = True 206 | df = general_utils.load_preprocessed_df(file_name=config.args.preprocessed_cohort_csv_path, 207 | min_n_tiles=1, 208 | explode=False) 209 | 210 | with open('../global_config.yaml', 'r') as f: 211 | DIRECTORIES = yaml.safe_load(f) 212 | DATA_DIR = DIRECTORIES['data_dir'] 213 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x)) 214 | 215 | if serial: 216 | for _, row in df.iterrows(): 217 | pretile_slide(row=row, 218 | tile_size=config.args.tile_size, 219 | tile_dir=config.args.tile_dir, 220 | normalize=config.args.normalize, 221 | overlap=config.args.overlap) 222 | else: 223 | Parallel(n_jobs=64)(delayed(pretile_slide)(row=row, 224 | tile_size=config.args.tile_size, 225 | tile_dir=config.args.tile_dir, 226 | normalize=config.args.normalize, 227 | overlap=config.args.overlap) 228 | for _, row in df.iterrows()) 229 | -------------------------------------------------------------------------------- /tissue-type-training/pretilings/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/pretilings/.gitkeep -------------------------------------------------------------------------------- /tissue-type-training/train_on_all_annotations.sh: -------------------------------------------------------------------------------- 1 | ARGS="--magnification 20 2 | --tile_dir pretilings 3 | --cohort_csv_path data/dataframes/tissuetype_hne_df.csv 4 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_tissuetype_hne_df.csv 5 | --tile_selection_type manual 6 | --otsu_threshold 0.2 7 | --batch_size 96 8 | --min_n_tiles 1 9 | --num_epochs 22 10 | --crossval 0 11 | --gpu 0 12 | --overlap 32 13 | --tile_size 64 14 | --normalize 15 | --model resnet18 16 | --experiment_name fulldata 17 | --learning_rate 0.0005" 18 | 19 | python train_tissue_tile_clf.py ${ARGS} 20 | -------------------------------------------------------------------------------- /tissue-type-training/train_tissue_tile_clf.py: -------------------------------------------------------------------------------- 1 | from torch.optim import Adam 2 | from torch.utils.data import DataLoader 3 | from sklearn.metrics import f1_score, accuracy_score, confusion_matrix 4 | from sklearn.utils import compute_class_weight 5 | 6 | import numpy as np 7 | import csv 8 | import pickle 9 | import config 10 | import general_utils 11 | import torch 12 | import re 13 | import pandas as pd 14 | import os 15 | 16 | from dataset import TissueTileDataset 17 | from general_utils import get_checkpoint_path, get_train_transforms, get_val_transforms, \ 18 | log_results_string 19 | from models import TissueTileNet, get_model 20 | 21 | 22 | def train_epoch(model, train_loader, optimizer, device, criterion): 23 | model.train() 24 | total_loss = 0 25 | y_pred = [] 26 | y_true = [] 27 | n = len(train_loader.dataset) 28 | for idx, tiles, labels in train_loader: 29 | # forward 30 | optimizer.zero_grad() 31 | output = model(tiles.to(device)) 32 | 33 | # calculate loss 34 | loss = criterion(input=output, target=labels.long().to(device)) 35 | loss.backward() 36 | 37 | # backward 38 | optimizer.step() 39 | total_loss += loss.detach().cpu().item() 40 | 41 | # keep track of true and predicted classes 42 | y_pred.extend(output.argmax(1).detach().cpu().reshape(-1).numpy()) 43 | y_true.extend(labels.reshape(-1).numpy()) 44 | 45 | total_loss /= n 46 | acc = accuracy_score(y_true=y_true, y_pred=y_pred) 47 | f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro') 48 | confusion = confusion_matrix(y_true=y_true, y_pred=y_pred) 49 | return total_loss, acc, f1, confusion 50 | 51 | 52 | def validate_epoch(model, val_loader, device, criterion): 53 | model.eval() 54 | total_loss = 0 55 | y_pred = [] 56 | y_true = [] 57 | n = len(val_loader.dataset) 58 | with torch.no_grad(): 59 | for idx, tiles, labels in val_loader: 60 | # forward 61 | output = model(tiles.to(device)) 62 | 63 | # calculate loss 64 | loss = criterion(input=output, target=labels.long().to(device)) 65 | total_loss += loss.detach().cpu().item() 66 | 67 | # keep track of true and predicted classes 68 | y_pred.extend(output.argmax(1).detach().cpu().reshape(-1).numpy()) 69 | y_true.extend(labels.reshape(-1).numpy()) 70 | 71 | total_loss /= n 72 | acc = accuracy_score(y_true=y_true, y_pred=y_pred) 73 | f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro') 74 | confusion = confusion_matrix(y_true=y_true, y_pred=y_pred) 75 | return total_loss, acc, f1, confusion 76 | 77 | 78 | def make_preds(model, loader, device, file_name, n_classes): 79 | header = ['tile_file_name', 'label'] 80 | header.extend(['score_{}'.format(k) for k in range(n_classes)]) 81 | with open(file_name, 'w', newline='') as file: 82 | writer = csv.writer(file, delimiter=',') 83 | writer.writerow(header) 84 | 85 | model.eval() 86 | with torch.no_grad(): 87 | for ids, tiles, labels in loader: 88 | preds = model(tiles.to(device)) 89 | preds = preds.detach().cpu().tolist() 90 | for idx, label, pred_list in zip(ids, labels.tolist(), preds): 91 | row = [idx, label] 92 | row.extend(pred_list) 93 | writer.writerow(row) 94 | 95 | 96 | def serialize(device_id, df, experiment_name, fold): 97 | starting_epoch = 0 98 | device = torch.device('cuda:{}'.format(device_id)) 99 | train_df = df[df.split == 'train'].copy() 100 | assert 'val' not in train_df.split 101 | val_df = df[df.split == 'val'].copy() 102 | assert 'train' not in val_df.split 103 | 104 | print('train ({}):'.format(len(train_df))) 105 | print(train_df.tile_class.value_counts()) 106 | print('val ({}):'.format(len(val_df))) 107 | print(val_df.tile_class.value_counts()) 108 | do_validation = len(val_df) > 0 109 | 110 | train_dataset = TissueTileDataset(df=train_df, 111 | tile_dir=config.args.tile_dir, 112 | transforms=get_train_transforms(normalize=True)) 113 | train_loader = DataLoader(train_dataset, 114 | batch_size=config.args.batch_size, 115 | num_workers=8, 116 | pin_memory=False, 117 | shuffle=True) 118 | 119 | if do_validation: 120 | val_dataset = TissueTileDataset(df=val_df, 121 | tile_dir=config.args.tile_dir, 122 | transforms=get_val_transforms(normalize=True)) 123 | val_loader = DataLoader(val_dataset, 124 | batch_size=config.args.batch_size, 125 | num_workers=8) 126 | 127 | model = TissueTileNet(model=get_model(config), n_classes=n_classes) 128 | model.to(device) 129 | 130 | optimizer = Adam(model.parameters(), 131 | lr=config.args.learning_rate, 132 | weight_decay=config.args.weight_decay) 133 | 134 | criterion = torch.nn.CrossEntropyLoss(weight=torch.tensor(compute_class_weight( 135 | class_weight='balanced', 136 | classes=np.sort(train_df.tile_class.unique()), 137 | y=train_df.tile_class)).to(device).float()) 138 | 139 | for epoch in range(config.args.num_epochs): 140 | train_loss, train_acc, train_f1, train_confusion = train_epoch(model, 141 | train_loader, 142 | optimizer, 143 | device, 144 | criterion) 145 | if do_validation: 146 | val_loss, val_acc, val_f1, val_confusion = validate_epoch(model, 147 | val_loader, 148 | device, 149 | criterion) 150 | else: 151 | val_loss, val_acc, val_f1, val_confusion = -1, -1, -1, -1 152 | 153 | log_results_string(epoch, starting_epoch, train_loss, val_loss, experiment_name, fold) 154 | results_str = 'training acc {:7.6f} | validation acc {:7.6f}'.format(train_acc, val_acc) 155 | print(results_str) 156 | results_str = 'training f1 {:7.6f} | validation f1 {:7.6f}'.format(train_f1, val_f1) 157 | print(results_str) 158 | print('training:\n' + str(train_confusion)) 159 | print('validation:\n' + str(val_confusion)) 160 | print('---') 161 | torch.save(model.state_dict(), get_checkpoint_path(experiment_name, 162 | epoch, starting_epoch, 163 | fold=fold)) 164 | 165 | 166 | def prep_df(csv_path, tile_dir, map_classes=True): 167 | # load dataframe 168 | df_ = pd.read_csv(csv_path, low_memory=False) 169 | df_ = df_[df_.n_foreground_tiles > 0] 170 | df_.tile_address = df_.tile_address.map(eval) 171 | df_ = df_.explode('tile_address').reset_index() 172 | df_['tile_class'] = df_.tile_address.apply(lambda x: x[1]) 173 | df_['tile_address'] = df_.tile_address.apply(lambda x: x[0]) 174 | 175 | slideviewer_class_map = {1: 'Stroma', 176 | 2: 'Stroma', 177 | 3: 'Tumor', 178 | 4: 'Tumor', 179 | 5: 'Fat', 180 | 6: 'Vessel', 181 | 7: 'Vessel', 182 | 10: 'Necrosis', 183 | # 11: 'Glass', 184 | 14: 'Pen'} 185 | map_key = {'Stroma': 0, 186 | 'Tumor': 1, 187 | 'Fat': 2, 188 | 'Necrosis': 3} 189 | # 'Vessel': 3, 190 | # 'Necrosis': 4} 191 | # 'Glass': 5, 192 | # 'Pen': 5} 193 | 194 | n_classes = len(map_key) 195 | map_reverse_key = dict([(v, k) for k, v in map_key.items()]) 196 | 197 | #print(df_.tile_class.value_counts()) 198 | if map_classes: 199 | df_['tile_class'] = df_['tile_class'].map(slideviewer_class_map).map(map_key) 200 | df_ = df_[df_['tile_class'].isin(map_key.values())] 201 | #print(df_.tile_class.value_counts()) 202 | #print(df_) 203 | df_['tile_file_name'] = tile_dir + '/' + \ 204 | df_['image_path'].apply(lambda x: x.split('/')[-1].replace('.svs', '')) + '/' + \ 205 | df_['tile_address'].apply(lambda x: str(x).replace('(', '').replace(')', '').replace(', ', '_') + '.png') 206 | return df_, n_classes, map_key, map_reverse_key 207 | 208 | 209 | if __name__ == '__main__': 210 | df_, n_classes, map_key, map_reverse_key = prep_df(config.args.preprocessed_cohort_csv_path, 211 | tile_dir=config.args.tile_dir) 212 | seed = 1123011750 213 | 214 | with open('checkpoints/{}_config.pickle'.format(config.args.experiment_name), 'wb') as f: 215 | pickle.dump(config.args, f) 216 | 217 | if config.args.crossval > 0: 218 | for fold, df in enumerate(general_utils.k_fold_ptwise_crossval(df_, config.args.crossval, seed)): 219 | print('\nFOLD {}'.format(fold)) 220 | serialize(config.args.gpu[0], df, config.args.experiment_name, fold) 221 | elif config.args.crossval == 0: 222 | print('warning: no validation set') 223 | df_.loc[:, 'split'] = 'train' 224 | serialize(config.args.gpu[0], df_,config.args.experiment_name, -2) 225 | -------------------------------------------------------------------------------- /tissue-type-training/visualizations/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/visualizations/.gitkeep --------------------------------------------------------------------------------