├── README.md
├── ct-feature-extraction
├── environment.yml
├── extract_ct_features.py
├── extract_ct_features.sh
├── features
│ ├── .gitkeep
│ ├── ct_features_omentum.csv
│ └── ct_features_ovary.csv
├── make_windowed_vols.py
├── params_left_ovary25.yaml
├── params_omentum25.yaml
├── params_right_ovary25.yaml
└── process_dataframes.py
├── feature-selection
├── environment.yml
├── results
│ ├── .gitkeep
│ ├── hr_ct_features_omentum.csv
│ ├── hr_ct_features_ovary.csv
│ └── hr_hne_features.csv
├── select_features.py
└── select_features.sh
├── global_config.yaml
├── hne-feature-extraction
├── 0_get_cohort_csv.sh
├── 1_infer_tissue_types_and_extract_features.sh
├── 1_infer_tissue_types_and_extract_features.sub
├── 2_extract_objects.sh
├── 2_extract_objects.sub
├── 3_label_objects_and_extract_features.sh
├── 3_label_objects_and_extract_features.sub
├── _run_gpu_qupath_stardist_singleSlide.sh
├── bitmaps
│ └── .gitkeep
├── connector.py
├── environment.yml
├── extract_feats_from_bitmaps.py
├── extract_feats_from_object_detections.py
├── final_objects
│ └── .gitkeep
├── hne-feature-extraction.dag
├── infer_tissue_tile_clf.py
├── inference
│ └── .gitkeep
├── map_inference_to_bitmap.py
├── merge_cells_and_regions.py
├── qupath
│ ├── Makefile
│ ├── README.md
│ ├── data
│ │ ├── results
│ │ │ └── .gitkeep
│ │ └── slides
│ │ │ └── .gitkeep
│ ├── detections
│ │ └── CMU-1-Small-Region_2_stardist_detections_and_measurements.tsv
│ ├── docker-compose.yml
│ ├── dockerfile
│ ├── init_singularity_env.sh
│ ├── models
│ │ ├── ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json
│ │ └── he_heavy_augment
│ │ │ ├── saved_model.pb
│ │ │ └── variables
│ │ │ ├── variables.data-00000-of-00001
│ │ │ └── variables.index
│ └── scripts
│ │ └── stardist_nuclei_and_lymphocytes.groovy
├── tissue_tile_features
│ └── reference_hne_features.csv
└── visualizations
│ └── .gitkeep
├── license.md
├── survival-modeling
├── environment.yml
├── figures
│ ├── .gitkeep
│ ├── barplots
│ │ └── .gitkeep
│ ├── crs_plots
│ │ └── .gitkeep
│ ├── feature_plots
│ │ └── .gitkeep
│ ├── forest_plots
│ │ └── .gitkeep
│ ├── km_plots
│ │ └── .gitkeep
│ └── multimodal
│ │ └── .gitkeep
├── results
│ ├── .gitkeep
│ ├── crs
│ │ └── .gitkeep
│ └── model_summaries
│ │ └── .gitkeep
├── train_test.py
└── utils.py
└── tissue-type-training
├── checkpoints
├── .gitkeep
└── tissue_type_classifier_weights.torch
├── config.py
├── confusion_matrix_analysis.py
├── cross_validate_on_annotations.sh
├── dataset.py
├── environment.yml
├── eval_tissue_tile.py
├── evals
└── .gitkeep
├── general_utils.py
├── models.py
├── pred_tissue_tile.py
├── predictions
└── .gitkeep
├── preprocess.py
├── pretile.py
├── pretilings
└── .gitkeep
├── train_on_all_annotations.sh
├── train_tissue_tile_clf.py
└── visualizations
└── .gitkeep
/README.md:
--------------------------------------------------------------------------------
1 | # OncoFusion
2 | This software extracts features from histopathologic whole-slide images, contrast-enhanced computed tomography, targeted sequencing panels, and clinical covariates and subsequently integrates them using a late-fusion machine learning model to stratify patients by overall survival. Repository to accompany [Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer](https://www.nature.com/articles/s43018-022-00388-9).
3 |
4 | ## Requirements
5 |
6 | Hardware: Tested on a server with 96 CPUs, 500GB CPU RAM, 4GPUs (Tesla V100, CUDA Version: 11.4), 64 GB GPU RAM, 1TB storage
7 |
8 | Software: Tested on Redhat Enterprise Linux v7.8 with Python v3.9, Conda v4.12, Singularity v3.8.3, and the conda environments specified in the environment.yml files in each sub-directory.
9 |
10 | ## Set up
11 |
12 | ### Download Synapse repository
13 | https://www.synapse.org/#!Synapse:syn25946117/wiki/611576
14 |
15 | ### Download H&E WSIs
16 | Download H&E WSIs listed within the downloaded Synapse repository at `data/hne/tcga/manifest.txt` using GDC Data Transfer Tool (https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/). Ensure a flat file structure (.svs files within the `data/hne/tcga` folder).
17 |
18 | ### Clone this GitHub repository
19 | It is recommended to clone the GitHub repository into the same directory as the Synapse repository. Conda environments are provided as `environment.yml` packages for each stage of the pipeline.
20 |
21 | ### Set global parameters
22 | In `global_config.yaml`, set the full paths to the directories enclosing the data and code. All scripts assume that the code and data are within subdirectories of these paths, enitled `code` and `data` respectively.
23 |
24 | ### Move Singularity image to code repository
25 | Move `qupath-stardist_latest.sif` from `data` to `code/hne-feature-extraction/qupath`.
26 |
27 | ## Tissue type training
28 | Using annotations by gynecologic pathologists (found within the `tissue-type-training` directory of the Synapse repository), train a semantic segmentation model to infer tissue type from H&E images. This component is optional: the resulting weights of our training are already stored in `tissue-type-training/checkpoints/tissue_type_classifier_weights.torch`. Other than paths set in global YAML file in the previous step, all options are set in `config.py`. For help, use `python config.py --h`.
29 |
30 | ### Cross-validate model for tissue type inference
31 | `tissue-type-training/cross_validate_on_annotations.sh`
32 | Use this to explore various model types and hyperparameter configurations.
33 |
34 | ### Train model for tissue type inference
35 | `tissue-type-training/train_on_all_annotations.sh` Note that `preprocess.py` and `pretile.py` must be run before this step (a sufficient example is in the cross validation script, so running that before this is sufficient).
36 |
37 | ## H&E feature extraction
38 |
39 | ### Extract tissue type features
40 | Next, we apply our trained model to semantically segment tissue types on slides from our multimodal patient cohort: `hne-feature-extraction/1_infer_tissue_types_and_extract_features.sh`. This is the process that ultimately generates the tissue type-based features in `hne-feature-extraction/tissue_tile_features/reference_hne_features.csv`.
41 |
42 | ### Identify nuclei
43 | Using the StarDist extension for QuPath, we perform instance segmentation of cellular nuclei and apply a bespoke classification script to distinguish lymphocytes from other nuclei: `hne-feature-extraction/2_extract_objects.sh`. Before running this script, move or copy slides of interest from `data/hne` to `code/hne-feature-extraction/qupath/data/slides`.
44 |
45 | ### Label nuclei by tissue type; extract nuclear features
46 | Finally, we coregister the two feature spaces and extract descriptive statistics for nuclei of each cell type: `hne-feature-extraction/3_label_objects_and_extract_features.sh`. This is the process that ultimately generates the nuclear features in `reference_hne_features.csv`.
47 |
48 |
49 | ## CT feature extraction
50 | We apply the abdominal window and extract features from omental and adnexal lesions contoured by fellowship-trained diagnostic radiologists: `ct-feature-extraction/extract_ct_features.sh`. Features are stored as a csv file in the features subdirectory.
51 |
52 |
53 | ## Feature selection
54 | We use log partial hazard ratios and their associated significance calculated on univariate Cox regression to select informative features from the CT and H&E feature spaces: `feature-selection/select_features.sh`. The log partial hazard ratio for each feature across the training cohort and the associated volcano plot is generated for each modality in the results subdirectory.
55 |
56 |
57 | ## Survival modeling
58 | Use `feature-selection/environment.yml`.
59 | We build univariate survival models for histopathologic, radiologic, clinical, and genomic information spaces. Subsequently, we combine the modalities in a late fuson framework and plot the performance: `survival-modeling/train_test.py`. Relevant results and figures are generated in the respective subdirectories.
60 |
--------------------------------------------------------------------------------
/ct-feature-extraction/environment.yml:
--------------------------------------------------------------------------------
1 | name: pyrad
2 | channels:
3 | - conda-forge
4 | - anaconda
5 | - defaults
6 | dependencies:
7 | - _libgcc_mutex=0.1=main
8 | - _openmp_mutex=4.5=1_gnu
9 | - blas=1.0=mkl
10 | - ca-certificates=2022.3.29=h06a4308_1
11 | - certifi=2021.5.30=py36h06a4308_0
12 | - cudatoolkit=10.1.243=h6bb024c_0
13 | - cycler=0.11.0=pyhd3eb1b0_0
14 | - dbus=1.13.18=hb2f20db_0
15 | - expat=2.4.4=h295c915_0
16 | - fontconfig=2.13.1=h6c09931_0
17 | - freetype=2.11.0=h70c0345_0
18 | - future=0.18.2=py36_1
19 | - glib=2.63.1=h5a9c865_0
20 | - gst-plugins-base=1.14.0=hbbd80ab_1
21 | - gstreamer=1.14.0=hb453b48_1
22 | - icu=58.2=he6710b0_3
23 | - intel-openmp=2022.0.1=h06a4308_3633
24 | - joblib=1.0.1=pyhd3eb1b0_0
25 | - jpeg=9d=h7f8727e_0
26 | - kiwisolver=1.3.1=py36h2531618_0
27 | - ld_impl_linux-64=2.35.1=h7274673_9
28 | - libffi=3.2.1=hf484d3e_1007
29 | - libgcc-ng=9.3.0=h5101ec6_17
30 | - libgfortran-ng=7.5.0=ha8ba4b0_17
31 | - libgfortran4=7.5.0=ha8ba4b0_17
32 | - libgomp=9.3.0=h5101ec6_17
33 | - libpng=1.6.37=hbc83047_0
34 | - libstdcxx-ng=9.3.0=hd4cf53a_17
35 | - libuuid=1.0.3=h7f8727e_2
36 | - libxcb=1.14=h7b6447c_0
37 | - libxml2=2.9.12=h03d6c58_0
38 | - matplotlib=3.1.3=py36_0
39 | - matplotlib-base=3.1.3=py36hef1b27d_0
40 | - mkl=2020.2=256
41 | - mkl-service=2.3.0=py36he8ac12f_0
42 | - mkl_fft=1.3.0=py36h54f3939_0
43 | - mkl_random=1.1.1=py36h0573a6f_0
44 | - ncurses=6.3=h7f8727e_2
45 | - numpy-base=1.18.5=py36hde5b4d6_0
46 | - numpy-indexed=0.3.5=py_1
47 | - openssl=1.1.1n=h7f8727e_0
48 | - pandas=1.0.3=py36h0573a6f_0
49 | - pcre=8.45=h295c915_0
50 | - pip=21.2.2=py36h06a4308_0
51 | - pydicom=2.1.2=pyhd3deb0d_0
52 | - pyparsing=3.0.4=pyhd3eb1b0_0
53 | - pyqt=5.9.2=py36h05f1152_2
54 | - python=3.6.10=hcf32534_1
55 | - python-dateutil=2.8.2=pyhd3eb1b0_0
56 | - pytz=2021.3=pyhd3eb1b0_0
57 | - qt=5.9.7=h5867ecd_1
58 | - readline=8.1.2=h7f8727e_1
59 | - scikit-learn=0.22.1=py36hd81dba3_0
60 | - scipy=1.5.2=py36h0b6359f_0
61 | - setuptools=58.0.4=py36h06a4308_0
62 | - sip=4.19.8=py36hf484d3e_0
63 | - six=1.16.0=pyhd3eb1b0_1
64 | - sqlite=3.38.2=hc218d9a_0
65 | - tk=8.6.11=h1ccaba5_0
66 | - tornado=6.1=py36h27cfd23_0
67 | - wheel=0.37.1=pyhd3eb1b0_0
68 | - xlrd=1.2.0=py36_0
69 | - xz=5.2.5=h7b6447c_0
70 | - zlib=1.2.12=h7f8727e_2
71 | - pip:
72 | - docopt==0.6.2
73 | - medpy==0.4.0
74 | - numpy==1.18.2
75 | - pykwalify==1.7.0
76 | - pyradiomics==3.0.1
77 | - pywavelets==1.0.0
78 | - pyyaml==5.3.1
79 | - simpleitk==1.2.4
80 |
--------------------------------------------------------------------------------
/ct-feature-extraction/extract_ct_features.py:
--------------------------------------------------------------------------------
1 | """
2 | extract_ct_features.py extracts PyRadiomics-defined features from windowed CT volumes.
3 | """
4 |
5 | import pandas as pd
6 | import numpy as np
7 | import os
8 | import logging
9 | import yaml
10 |
11 | from joblib import Parallel, delayed
12 | from radiomics import featureextractor
13 |
14 |
15 | def define_all_lesions_for_one_site(df, params_fn, results_fn):
16 | print('-'*32)
17 | print(params_fn)
18 |
19 | all_results = Parallel(n_jobs=16)(delayed(extract_single)(params_fn, row) for _, row in df.iterrows())
20 | #all_results = []
21 | #for _, row in df.iterrows():
22 | # all_results.append(extract_single(params_fn, row))
23 | results = pd.DataFrame(all_results)
24 | results['Patient ID'] = results['Patient ID'].astype(str).apply(lambda x: x.zfill(3))
25 | results.to_csv(results_fn, index=False)
26 |
27 |
28 | def extract_single(params_fn, row):
29 | extractor = featureextractor.RadiomicsFeatureExtractor(params_fn)
30 | logger = logging.getLogger("radiomics")
31 | logger.setLevel(logging.ERROR)
32 | try:
33 | result = extractor.execute(os.path.join(DATA_DIR, row.windowed_image_path),
34 | os.path.join(DATA_DIR, row.segmentation_path))
35 | print('Extracted features successfully for {}'.format(row.windowed_image_path))
36 | except ValueError:
37 | result = {}
38 | print('WARNING: ValueError for {}'.format(row.windowed_image_path))
39 | except RuntimeError:
40 | result = {}
41 | print('WARNING: RuntimeError for {}'.format(row.windowed_image_path))
42 |
43 | result.update({'Patient ID': str(row['Patient ID'])})
44 | return result
45 |
46 |
47 | if __name__ == '__main__':
48 | with open('../global_config.yaml', 'r') as f:
49 | CONFIGS = yaml.safe_load(f)
50 | DATA_DIR = CONFIGS['data_dir']
51 |
52 | INPUT_DATAFRAME_PATH = os.path.join(DATA_DIR, 'data/dataframes/ct_df.csv')
53 | DF = pd.read_csv(INPUT_DATAFRAME_PATH)
54 | DF['Patient ID'] = DF['Patient ID'].astype(str)
55 | BINS = [25]
56 |
57 | for B in BINS:
58 | print('Extracting features for bin size {}'.format(B))
59 |
60 | define_all_lesions_for_one_site(DF,
61 | 'params_left_ovary{}.yaml'.format(B),
62 | 'features/_ct_left_ovary_bin{}.csv'.format(B))
63 |
64 | define_all_lesions_for_one_site(DF,
65 | 'params_right_ovary{}.yaml'.format(B),
66 | 'features/_ct_right_ovary_bin{}.csv'.format(B))
67 |
68 | define_all_lesions_for_one_site(DF,
69 | 'params_omentum{}.yaml'.format(B),
70 | 'features/_ct_omentum_bin{}.csv'.format(B))
71 |
72 |
--------------------------------------------------------------------------------
/ct-feature-extraction/extract_ct_features.sh:
--------------------------------------------------------------------------------
1 | python make_windowed_vols.py
2 | python extract_ct_features.py
3 | python process_dataframes.py
4 |
--------------------------------------------------------------------------------
/ct-feature-extraction/features/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/ct-feature-extraction/features/.gitkeep
--------------------------------------------------------------------------------
/ct-feature-extraction/make_windowed_vols.py:
--------------------------------------------------------------------------------
1 | """
2 | make_windowed_vols.py applies the abdominal window to the raw MHD files in data/dataframes/ct_df.csv
3 | and saves the resulting MHD files in data/ct/windowed_cts. TCGA MHD files must be downloaded before
4 | running.
5 | """
6 | import pandas as pd
7 | import numpy as np
8 | import os
9 | import yaml
10 |
11 | from medpy.io import load, save
12 | from joblib import delayed, Parallel
13 |
14 |
15 | with open('../global_config.yaml', 'r') as f:
16 | DATA_DIR = yaml.safe_load(f)['data_dir']
17 |
18 | LEVEL = 50
19 | WIDTH = 400
20 | LOWER_BOUND = LEVEL - WIDTH//2
21 | UPPER_BOUND = LEVEL + WIDTH//2
22 | INPUT_DATAFRAME_PATH = os.path.join(DATA_DIR, 'data/dataframes/ct_df.csv')
23 |
24 |
25 | def make_windowed(row):
26 | """
27 | Given row of CT data frame, load MHD file, apply window, and save windowed version.
28 | """
29 | input_tumor_img_fn = os.path.join(DATA_DIR, row['image_path'])
30 | output_tumor_img_fn = os.path.join(DATA_DIR, row['windowed_image_path'])
31 | try:
32 | tumor_img, header = load(input_tumor_img_fn)
33 | tumor_img = np.clip(tumor_img, a_min=LOWER_BOUND, a_max=UPPER_BOUND)
34 | sub_dir = '/'.join(output_tumor_img_fn.split('/')[:-1])
35 | if not os.path.exists(sub_dir):
36 | os.mkdir(sub_dir)
37 | save(tumor_img, output_tumor_img_fn, header)
38 | print('{} succeeded'.format(input_tumor_img_fn))
39 | except:
40 | print('{} failed'.format(input_tumor_img_fn))
41 |
42 |
43 | if __name__ == '__main__':
44 | df = pd.read_csv(INPUT_DATAFRAME_PATH)
45 | if not os.path.exists(os.path.join(DATA_DIR, 'data/ct/windowed_scans')):
46 | os.mkdir(os.path.join(DATA_DIR, 'data/ct/windowed_scans'))
47 | Parallel(n_jobs=16)(delayed(make_windowed)(row) for idx, row in df.iterrows())
48 | #for idx, row in df.iterrows():
49 | # make_windowed(idx, row)
50 |
--------------------------------------------------------------------------------
/ct-feature-extraction/params_left_ovary25.yaml:
--------------------------------------------------------------------------------
1 | setting:
2 | label: 46
3 | resampledPixelSpacing: [1.0, 1.0, 1.0]
4 | binWidth: 25
5 | interpolator: 'sitkBSpline'
6 |
7 | imageType:
8 | Wavelet: {}
9 |
10 |
--------------------------------------------------------------------------------
/ct-feature-extraction/params_omentum25.yaml:
--------------------------------------------------------------------------------
1 | setting:
2 | label: 43
3 | resampledPixelSpacing: [1.0, 1.0, 1.0]
4 | binWidth: 25
5 | interpolator: 'sitkBSpline'
6 |
7 | imageType:
8 | Wavelet: {}
9 |
10 |
--------------------------------------------------------------------------------
/ct-feature-extraction/params_right_ovary25.yaml:
--------------------------------------------------------------------------------
1 | setting:
2 | label: 45
3 | resampledPixelSpacing: [1.0, 1.0, 1.0]
4 | binWidth: 25
5 | interpolator: 'sitkBSpline'
6 |
7 | imageType:
8 | Wavelet: {}
9 |
10 |
--------------------------------------------------------------------------------
/ct-feature-extraction/process_dataframes.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 |
3 | def make_df(input_filenames):
4 | # load & concatenate if necessary
5 | if type(input_filenames) is str:
6 | input_filenames = [input_filenames]
7 |
8 | dfs = [pd.read_csv(input_filename, engine='python') for input_filename in input_filenames]
9 | df = pd.concat(dfs)
10 |
11 | # drop NaNs and duplicates (for ovarian)
12 | df = df.dropna(how='any')
13 | df = df.drop_duplicates(subset=['Patient ID'])
14 |
15 | # remove extraneous columns
16 | df = df.set_index('Patient ID').filter(regex='^wavelet', axis=1)
17 | df = df.reset_index()
18 |
19 | return df
20 |
21 |
22 | if __name__ == '__main__':
23 | omentum_df = make_df('features/_ct_omentum_bin25.csv')
24 | omentum_df.to_csv('features/ct_features_omentum.csv', index=False)
25 |
26 | ovary_df = make_df(['features/_ct_left_ovary_bin25.csv',
27 | 'features/_ct_right_ovary_bin25.csv'])
28 | ovary_df.to_csv('features/ct_features_ovary.csv', index=False)
29 |
--------------------------------------------------------------------------------
/feature-selection/environment.yml:
--------------------------------------------------------------------------------
1 | name: sklearn
2 | channels:
3 | - conda-forge
4 | - bioconda
5 | - defaults
6 | dependencies:
7 | - scikit-learn
8 | - pandas
9 | - lifelines
10 | - seaborn
11 | - xlrd
12 | - openpyxl
13 | - rasterio
14 | - shapely
15 | - albumentations
16 | - requests
17 | - bokeh
18 | - libtiff
19 | - colorcet
20 | - holoviews
21 | - pingouin
22 | - pip
23 | - pip:
24 | - statannot
25 | prefix: /Users/boehmk/anaconda3/envs/sklearn
26 |
--------------------------------------------------------------------------------
/feature-selection/results/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/feature-selection/results/.gitkeep
--------------------------------------------------------------------------------
/feature-selection/select_features.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import seaborn as sns
4 | import matplotlib.pyplot as plt
5 | import lifelines
6 | import yaml
7 | import os
8 | import warnings
9 |
10 | from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
11 | from scipy.stats import spearmanr, pearsonr, kendalltau, zscore, percentileofscore, chisquare, power_divergence
12 | from statsmodels.stats.outliers_influence import variance_inflation_factor
13 | from pingouin import partial_corr
14 | from lifelines import CoxPHFitter
15 | from lifelines.utils import concordance_index
16 | from joblib import Parallel, delayed
17 | from argparse import ArgumentParser
18 | from sklearn.cluster import KMeans
19 | from statsmodels.stats.multitest import multipletests
20 |
21 | warnings.simplefilter("ignore")
22 |
23 | FONTSIZE = 7
24 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE)
25 | plt.rc('xtick',labelsize=FONTSIZE)
26 | plt.rc('ytick',labelsize=FONTSIZE)
27 | plt.rc("axes", labelsize=FONTSIZE)
28 |
29 | def evaluate_feature_partial_correlation(df, feat_col, y_col, covariate_col, x_covariate_col, y_covariate_col, method):
30 | """
31 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing column "duration" (float) of survival time
32 | :param feat_col: column name (str) of feature values (float)
33 | :param y_col: column name (str) of outcomes (float)
34 | :param covariate_col: column name (str) of XY covariate (float)
35 | :param x_covariate_col: column name (str) of X covariate (float)
36 | :param y_covariate_col: column name (str) of Y covariate (float)
37 | :param method: name (str) of method supported by pingouin.partial_corr
38 | :return: single-entry dict with {feat_col (str): [p (float), corr (float)]}
39 | """
40 | results = partial_corr(data=df,
41 | x=feat_col,
42 | y=y_col,
43 | y_covar=y_covariate_col,
44 | x_covar=x_covariate_col,
45 | covar=covariate_col,
46 | method=method)
47 | corr = results['r'].item()
48 | p = results['p-val'].item()
49 | if np.isnan(p):
50 | p = 1.0
51 | corr = 0.0
52 | return {feat_col: [p, corr]}
53 |
54 |
55 | def evaluate_feature_cox(df, feat_col, covar):
56 | """
57 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing columns "duration" (float) of survival time and "observed" (float) to delineate observed [1.0] or censored [0.0] outcomes
58 | :param feat_col: column name (str) of feature values (float)
59 | :param covar_col: column name (str) of XY covariate (float)
60 | :return: single-entry dict with {feat_col (str): [p (float), log(partial_hazard_ratio) (float)]}
61 | """
62 |
63 | col_list = ['duration', 'observed', feat_col]
64 | if covar:
65 | col_list.append(covar)
66 |
67 | model = CoxPHFitter(penalizer=0.0)
68 | try:
69 | model.fit(df[col_list], duration_col='duration', event_col='observed')
70 | except (lifelines.exceptions.ConvergenceError, lifelines.exceptions.ConvergenceWarning) as e:
71 | try:
72 | model = CoxPHFitter(penalizer=0.2)
73 | model.fit(df[col_list], duration_col='duration', event_col='observed')
74 | except (lifelines.exceptions.ConvergenceError, lifelines.exceptions.ConvergenceWarning) as e:
75 | return {feat_col: [1.0, 0.0]}
76 |
77 | coef = model.summary.coef[feat_col]
78 | p = model.summary.p[feat_col]
79 |
80 | return {feat_col: [p, coef]}
81 |
82 |
83 | def evaluate_feature_concordance(df, feat_col, k_permutations=100):
84 | """
85 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome, containing columns "duration" (float) of survival time and "observed" (float) to delineate observed [1.0] or censored [0.0] outcomes
86 | :param feat_col: column name (str) of feature values (float)
87 | :param k_permutations: number of interations (int) for permutation test to assess statistical significance
88 | :return: single-entry dict with {feat_col (str): [p (float), log(partial_hazard_ratio) (float)]}
89 | """
90 |
91 | c = lifelines.utils.concordance_index(event_times=df['duration'],
92 | predicted_scores=df[feat_col],
93 | event_observed=df['observed'])
94 | directional_deviance = c - 0.5
95 | absolute_deviance = np.abs(directional_deviance)
96 |
97 | random_absolute_deviances = []
98 | for _ in range(k_permutations):
99 | random_c = lifelines.utils.concordance_index(event_times=df['duration'],
100 | predicted_scores=df[feat_col].sample(frac=1),
101 | event_observed=df['observed'])
102 | random_absolute_deviances.append(np.abs(random_c - 0.5))
103 | p = (np.array(random_absolute_deviances) >= absolute_deviance).mean()
104 | return {feat_col: [p, directional_deviance]}
105 |
106 |
107 | def evaluate_features(df, feats_to_consider, method='kendall', covar=None, x_covar=None, y_covar=None, n_jobs=-1):
108 | """
109 | :param df: pandas DataFrame with each row being a patient and each column being a feature or outcome
110 | :param feats_to_consider: list of column names (str) identifying feature columns
111 | :param method: name (str) of method supported by pingouin.partial_corr, "cph" for Cox regression, or "c-index" for concordance assessment
112 | :param covar: column name (str) of XY covariate (float)
113 | :param x_covar: column name (str) of X covariate (float)
114 | :param y_covar: column name (str) of Y covariate (float)
115 | :param n_jobs: number of parallel jobs to run for feature assessment
116 | :return: pandas DataFrame with columns ['feat', 'p', 'stat']
117 | """
118 | assert not df.isna().sum().any()
119 | assert ('duration' in df.columns) and ('observed' in df.columns)
120 |
121 | if method == 'cph':
122 | assert (not x_covar) and (not y_covar)
123 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_cox)
124 | (df, effect, covar) for effect in feats_to_consider)
125 | elif method == 'c-index':
126 | assert (not covar) and (not x_covar) and (not y_covar)
127 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_concordance)
128 | (df, effect) for effect in feats_to_consider)
129 | elif method in ['pearson', 'spearman']:
130 | assert (not (covar and x_covar)) and (not (covar and y_covar))
131 | dicts = Parallel(n_jobs=n_jobs)(delayed(evaluate_feature_partial_correlation)
132 | (df[df.observed], effect, 'duration', covar, x_covar, y_covar, method) for effect in feats_to_consider)
133 | else:
134 | raise RuntimeError("Unknown method {}".format(method))
135 |
136 | results = {}
137 | for dict_ in dicts:
138 | results.update(dict_)
139 | results = list(results.items())
140 |
141 | feats = np.array([x[0] for x in results])
142 | p_values = np.array([x[1][0] for x in results])
143 | stat_values = np.array([x[1][1] for x in results])
144 | results_df = pd.DataFrame({'feat': feats, 'p': p_values, 'stat': stat_values})
145 |
146 | if method == 'response-agnostic':
147 | results_df['p'] = results_df.stat.apply(lambda x: (100 - percentileofscore(results_df.stat, x))/100)
148 | results_df = results_df.sort_values(by='p')
149 |
150 | return results_df
151 |
152 |
153 | def _get_features_to_consider(all_columns, args):
154 | """
155 | :param all_columns: list of columns in input dataframe
156 | :param args: parsed arguments
157 | :return: list of actual feature-containing columns (excluding covariates and outcomes)
158 | """
159 | columns_to_exclude = ['duration', 'observed', 'Unnamed: 0']
160 | for col in [args.xy_covar, args.x_covar, args.y_covar, args.index_col]:
161 | if col:
162 | columns_to_exclude.append(col)
163 | return list(set(all_columns) - set(columns_to_exclude))
164 |
165 |
166 | def preprocess_features(df, outlier_threshold, feature_names=None, scaler=None):
167 | """
168 | :param df: input dataframe with features
169 | :outlier threshold: number of std devs above which we define an outlier
170 | :feature_names: list of feature names (str)
171 | :return: dataframe with processed features (removed outliers, scaled to zero mean, unit variance)
172 | """
173 | if feature_names:
174 | x = df[feature_names].copy(deep=True)
175 | y = df.drop(columns=feature_names).copy(deep=True)
176 | else:
177 | x = df
178 |
179 | # x[(zscore(x.values.astype(float))) < -outlier_threshold] = -outlier_threshold
180 | # x[(zscore(x.values.astype(float))) > outlier_threshold] = outlier_threshold
181 | if outlier_threshold != -1:
182 | x[(np.abs(zscore(x.values.astype(float))) > outlier_threshold)] = np.nan
183 | x = x.fillna(x.median())
184 | if not scaler:
185 | scaler = MinMaxScaler()
186 | # scaler = StandardScaler()
187 | # scaler = RobustScaler()
188 | x = pd.DataFrame(scaler.fit_transform(x.values), columns=x.columns, index=x.index)
189 | else:
190 | x = pd.DataFrame(scaler.transform(x.values), columns=x.columns, index=x.index)
191 | # scaler = RobustScaler()
192 |
193 | if outlier_threshold != -1:
194 | x[x>outlier_threshold]=outlier_threshold
195 | x[x<-outlier_threshold]=-outlier_threshold
196 | else:
197 | x[x>5]=5
198 | x[x<-5]=-5
199 |
200 | if feature_names:
201 | df = x.join(y)
202 | else:
203 | df = x
204 | return df, scaler
205 |
206 |
207 | def _get_x_axis_name_volcano(method):
208 | if method == 'kendall':
209 | x_axis_name = "Kendall's $\\tau$"
210 | elif method == 'spearman':
211 | x_axis_name = "Spearman's $\\rho$"
212 | elif method == 'pearson':
213 | x_axis_name = "Pearson's $\\rho$"
214 | elif method == 'response-agnostic':
215 | x_axis_name = "IQR"
216 | elif method == 'cph':
217 | x_axis_name = "log(Hazard ratio)"
218 | elif method == 'c-index':
219 | x_axis_name = "concordance (deviation from random)"
220 | else:
221 | raise RuntimeError("Cannot generate volcano plot x-axis name for method {}".format(method))
222 | return x_axis_name
223 |
224 |
225 | def _make_results_df_pretty_for_plotting(df_, x_axis_name, modality, eps=1e-30):
226 | df_.loc[df_.p < eps, 'p'] = eps
227 | df = pd.DataFrame({'feature': df_.feat,
228 | '-log(p)': -np.log10(df_.p),
229 | x_axis_name: df_.stat})
230 | if modality == 'radiology':
231 | try:
232 | for feat_name in df.feature:
233 | assert 'original-' not in feat_name
234 | assert 'diagnostic' not in feat_name
235 | except AssertionError:
236 | raise RuntimeError("Feature {} appears not to be a wavelet feature. The volcano plot color coding only supports wavelet feature.".format(feat_name))
237 |
238 | df['abbreviated_feature'] = df.feature.str.replace('wavelet-', '')
239 | df['abbreviated_feature'] = df.abbreviated_feature.str.replace('log-sigma-1-0-mm-3D_', 'LoG_')
240 | df['abbreviated_feature'] = df.abbreviated_feature.str.replace('log-sigma-3-0-mm-3D_', 'LoG_')
241 |
242 | df['abbreviated_feature'] = df['abbreviated_feature'].apply(lambda x: '_'.join([x.split('_')[0], x.split('_')[-1]]))
243 |
244 | df.Matrix = 'Other'
245 | for matrix in ['glszm', 'ngtdm', 'glcm', 'glrlm', 'gldm']: #, 'firstorder'
246 | mask = df.feature.str.contains(matrix)
247 | count_ = mask.sum()
248 | df.loc[mask, 'Matrix'] = matrix
249 | else:
250 | df['Feature'] = 'Other'
251 | for feat_type in ['Tumor_Other', 'Tumor_Lymphocyte', 'Stroma_Other', 'Stroma_Lymphocyte', 'Tumor', 'Necrosis', 'Stroma', 'Fat']:
252 | mask = (df.feature.str.contains(feat_type) | df.feature.str.contains(feat_type.lower())) & (df['Feature'] == 'Other')
253 | count_ = mask.sum()
254 | # df.loc[mask, 'Matrix'] = '{} (n={})'.format(matrix, count_)
255 | df.loc[mask, 'Feature'] = feat_type.replace('_Other', ' Nuclei').replace('_Lymphocyte', ' Lymphocyte')
256 | df['abbreviated_feature'] = df['feature'].str.replace('_Other_', ' Nuclei ' ).str.replace('_Lymphocyte_', ' Lymphocyte ')
257 |
258 | df = df.sort_values(by='feature')
259 | return df
260 |
261 | def make_volcano_plot(df_, method, output_plot_path, modality, top_k_to_label=1):
262 | x_axis_name = _get_x_axis_name_volcano(method)
263 |
264 | df = _make_results_df_pretty_for_plotting(df_, x_axis_name, modality)
265 | df.loc[df[x_axis_name] < -3, x_axis_name] = -3
266 | df.loc[df[x_axis_name] > 3, x_axis_name] = 3
267 |
268 | if modality == 'radiology':
269 | hue_col = 'Matrix'
270 | else:
271 | hue_col = 'Feature'
272 |
273 | if modality == 'radiology':
274 | df = df.sort_values(by='-log(p)', ascending=True)
275 |
276 | # plt.rcParams["axes.labelsize"] = 13
277 | fig = plt.figure(figsize=(3, 2), constrained_layout=True)
278 | g = sns.scatterplot(data=df,
279 | x=x_axis_name,
280 | y='-log(p)',
281 | hue=hue_col,
282 | alpha=0.7,
283 | palette='dark',
284 | hue_order=['glcm', 'gldm', 'glrlm', 'glszm', 'ngtdm'] if modality == 'radiology' else None,
285 | s=6)
286 | plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title=hue_col)
287 |
288 | df = df.sort_values(by='-log(p)', ascending=False)
289 | ax = plt.gca()
290 | if modality == 'radiology':
291 | sig_threshold = get_ci_95_pval(df)
292 | if sig_threshold != -1:
293 | plt.axhline(y=sig_threshold, color='.2', linewidth=0.5, linestyle='-.')
294 | else:
295 | plt.gca().set_ylim((-0.20185706772068027, 4.2681049500679045))
296 | plt.gca().spines['right'].set_visible(False)
297 | plt.gca().spines['top'].set_visible(False)
298 | if '.svg' in output_plot_path:
299 | plt.savefig(output_plot_path)
300 | else:
301 | plt.savefig(output_plot_path, dpi=300)
302 | plt.close()
303 |
304 | def get_ci_95_pval(df):
305 | p_vals = df['-log(p)'].apply(lambda x: float(10)**(-x)).tolist()
306 | reject, pvals_corrected, _, _ = multipletests(pvals=p_vals, alpha=0.05, method='fdr_bh', is_sorted=True)
307 | if reject.sum() == 0:
308 | return -1
309 | arg_ = np.argmax(pvals_corrected>=0.05)
310 | return -np.log10((p_vals[arg_ - 1] + p_vals[arg_])/2)
311 |
312 |
313 | if __name__ == '__main__':
314 | PARSER = ArgumentParser(description='select features for survival analysis')
315 | PARSER.add_argument('feature_df_path', type=str,
316 | help='path to pandas DataFrame with features and outcomes. all columns in the DF will be evaluated except "duration", "observed," and any covariates')
317 | PARSER.add_argument('--outcome_df_path', type=str, help='path to pd df with outcomes, optional', default=None)
318 | PARSER.add_argument('--train_id_df_path', type=str, help='path to pd df with train IDs, optional', default=None)
319 |
320 | PARSER.add_argument('--output_df_path', type=str, default='feature_evaluation.csv', help='path at which to save feature evaluation df')
321 | PARSER.add_argument('--output_plot_path', type=str, default='feature_evaluation.png', help='path at which to save feature evaluation volcano plot')
322 | PARSER.add_argument('--index_col', type=str, default='Patient ID', help="name of column to set as index")
323 | PARSER.add_argument('--outlier_std_threshold', type=float, default=5, help="number of standard deviations to use for clipping")
324 | PARSER.add_argument('--method', type=str, default='cph',
325 | help="'kendall' for Kendall's Tau, 'cph' for Cox Proportional Hazards, 'c-index' for Concordance, 'spearman,' 'response-agnostic', or 'pearson'")
326 | PARSER.add_argument('--xy_covar', type=str, default=None, help="XY covariate column name")
327 | PARSER.add_argument('--x_covar', type=str, default=None, help="X covariate column name")
328 | PARSER.add_argument('--y_covar', type=str, default=None, help="Y covariate column name")
329 | PARSER.add_argument('--n_jobs', type=int, default=-1, help="number of parallel jobs to use")
330 | PARSER.add_argument('--modality', type=str, default='radiology', help="radiology or pathology")
331 | ARGS = PARSER.parse_args()
332 |
333 | with open('../global_config.yaml', 'r') as f:
334 | CONFIGS = yaml.safe_load(f)
335 | DATA_DIR = CONFIGS['data_dir']
336 | CODE_DIR = CONFIGS['code_dir']
337 |
338 | DF = pd.read_csv(os.path.join(DATA_DIR, ARGS.feature_df_path)).set_index(ARGS.index_col)
339 |
340 | if ARGS.modality == 'radiology':
341 | DF = DF[[x for x in DF.columns if 'firstorder' not in x]]
342 |
343 | if ARGS.outcome_df_path:
344 | OUTCOME_DF = pd.read_csv(os.path.join(DATA_DIR, ARGS.outcome_df_path)).set_index(ARGS.index_col)[['duration.OS', 'observed.OS']]
345 | OUTCOME_DF = OUTCOME_DF.rename(columns={'duration.OS': 'duration', 'observed.OS': 'observed'})
346 | DF = DF.join(OUTCOME_DF, how='inner')
347 |
348 | if ARGS.train_id_df_path:
349 | DF = DF[DF.index.isin(pd.read_csv(os.path.join(DATA_DIR, ARGS.train_id_df_path))[ARGS.index_col])]
350 |
351 | try:
352 | assert not DF.isna().sum().any()
353 | except AssertionError:
354 | raise RuntimeError("Input dataframe must not contain any NaN values.")
355 |
356 | FEATURE_NAMES = _get_features_to_consider(DF.columns.tolist(), ARGS)
357 |
358 | DF, _ = preprocess_features(DF,
359 | outlier_threshold=ARGS.outlier_std_threshold,
360 | feature_names=FEATURE_NAMES)
361 |
362 | RESULTS = evaluate_features(df=DF,
363 | feats_to_consider=FEATURE_NAMES,
364 | method=ARGS.method,
365 | covar=ARGS.xy_covar,
366 | x_covar=ARGS.x_covar,
367 | y_covar=ARGS.y_covar,
368 | n_jobs=ARGS.n_jobs)
369 | RESULTS.to_csv(ARGS.output_df_path)
370 | make_volcano_plot(RESULTS, ARGS.method, ARGS.output_plot_path, ARGS.modality)
371 |
--------------------------------------------------------------------------------
/feature-selection/select_features.sh:
--------------------------------------------------------------------------------
1 | python select_features.py code/ct-feature-extraction/features/ct_features_omentum.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_ct_features_omentum.csv --output_plot_path results/hr_ct_features_omentum.png --modality radiology --method cph --train_id_df_path data/dataframes/train_ids.csv
2 |
3 | python select_features.py code/ct-feature-extraction/features/ct_features_ovary.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_ct_features_ovary.csv --output_plot_path results/hr_ct_features_ovary.png --modality radiology --method cph --train_id_df_path data/dataframes/train_ids.csv
4 |
5 | python select_features.py code/hne-feature-extraction/tissue_tile_features/reference_hne_features.csv --outcome_df_path data/dataframes/clin_df.csv --output_df_path results/hr_hne_features.csv --output_plot_path results/hr_hne_features.png --modality pathology --method cph --xy_covar n_foreground_tiles --train_id_df_path data/dataframes/train_ids.csv
6 |
7 |
--------------------------------------------------------------------------------
/global_config.yaml:
--------------------------------------------------------------------------------
1 | data_dir:
2 | code_dir:
3 |
--------------------------------------------------------------------------------
/hne-feature-extraction/0_get_cohort_csv.sh:
--------------------------------------------------------------------------------
1 | python connector.py > ../data/dataframes/hne_df.csv
2 |
--------------------------------------------------------------------------------
/hne-feature-extraction/1_infer_tissue_types_and_extract_features.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh
3 | conda activate transformer
4 |
5 | ARGS="--magnification 20
6 | --cohort_csv_path data/dataframes/hne_df.csv
7 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_hne_df.csv
8 | --tile_dir ../tissue-type-training/pretilings
9 | --tile_selection_type otsu
10 | --otsu_threshold 0.5
11 | --purple_threshold 0.05
12 | --batch_size 200
13 | --min_n_tiles 100
14 | --normalize
15 | --gpu 0 1 2 3
16 | --tile_size 128
17 | --model resnet18
18 | --checkpoint_path ../tissue-type-training/checkpoints/tissue_type_classifier_weights.torch"
19 |
20 | python ../tissue-type-training/preprocess.py ${ARGS}
21 | python ../tissue-type-training/pretile.py ${ARGS}
22 | python infer_tissue_tile_clf.py ${ARGS}
23 | python map_inference_to_bitmap.py ${ARGS}
24 | python extract_feats_from_bitmaps.py ${ARGS}
25 |
--------------------------------------------------------------------------------
/hne-feature-extraction/1_infer_tissue_types_and_extract_features.sub:
--------------------------------------------------------------------------------
1 | universe = vanilla
2 | executable = 1_infer_tissue_types_and_extract_features.sh
3 |
4 | # requirements to specify the execution machine needs.
5 | #requirements = (CUDACapability >= 4)
6 |
7 | # "short", "medium", "long" for jobs lasting
8 | # ~12 hr, ~24 hr, ~7 days
9 | +GPUJobLength = "short"
10 |
11 | request_gpus = 4
12 | request_memory = 4GB
13 | request_cpus = 64
14 | #request_disk = 10MB
15 |
16 | output = $(Cluster)_$(Process).out
17 | log = $(Cluster)_$(Process).log
18 | error = $(Cluster)_$(Process).err
19 |
20 | # number of jobs to submit
21 | queue 1
22 |
23 |
--------------------------------------------------------------------------------
/hne-feature-extraction/2_extract_objects.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh
3 | conda activate transformer
4 | for entry in "qupath/data/slides"/*.svs; do
5 | temp=`basename ${entry}`
6 | temp="${temp/.svs/.tsv}"
7 | temp="qupath/data/results/${temp}"
8 | echo ${temp}
9 | if test -f ${temp}; then
10 | echo "${temp} exists"
11 | else
12 | echo sh _run_gpu_qupath_stardist_singleSlide.sh `basename $entry`
13 | sh _run_gpu_qupath_stardist_singleSlide.sh `basename $entry`
14 | fi
15 | done
16 |
--------------------------------------------------------------------------------
/hne-feature-extraction/2_extract_objects.sub:
--------------------------------------------------------------------------------
1 | universe = vanilla
2 | executable = _run_gpu_qupath_stardist_singleSlide.sh
3 | arguments = $(filename)
4 |
5 | # requirements to specify the execution machine needs.
6 | #requirements = (CUDACapability >= 4)
7 |
8 | # "short", "medium", "long" for jobs lasting
9 | # ~12 hr, ~24 hr, ~7 days
10 | +GPUJobLength = "short"
11 |
12 | request_gpus = 1
13 | request_memory = 36GB
14 | request_cpus = 10
15 | #request_disk = 10MB
16 |
17 | output = $(Cluster)_$(Process).out
18 | log = $(Cluster)_$(Process).log
19 | error = $(Cluster)_$(Process).err
20 |
21 | # number of jobs to submit
22 | queue filename from cut -d, -f2 ../data/dataframes/preprocessed_hne_df.csv |
23 |
--------------------------------------------------------------------------------
/hne-feature-extraction/3_label_objects_and_extract_features.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh
4 | conda activate transformer
5 | ARGS="
6 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_hne_df.csv
7 | --checkpoint_path ../tissue-type-training/checkponts/tissue_type_classifier_weights.torch"
8 |
9 | python merge_cells_and_regions.py ${ARGS}
10 | python extract_feats_from_object_detections.py ${ARGS}
11 |
--------------------------------------------------------------------------------
/hne-feature-extraction/3_label_objects_and_extract_features.sub:
--------------------------------------------------------------------------------
1 | universe = vanilla
2 | executable = 3_label_objects_and_extract_features.sh
3 |
4 | # requirements to specify the execution machine needs.
5 | #requirements = (CUDACapability >= 4)
6 |
7 | # "short", "medium", "long" for jobs lasting
8 | # ~12 hr, ~24 hr, ~7 days
9 | +GPUJobLength = "short"
10 |
11 | #request_gpus = 0
12 | request_memory = 384
13 | request_cpus = 32
14 | #request_disk = 10MB
15 |
16 | output = $(Cluster)_$(Process).out
17 | log = $(Cluster)_$(Process).log
18 | error = $(Cluster)_$(Process).err
19 |
20 | # number of jobs to submit
21 | queue 1
22 |
23 |
--------------------------------------------------------------------------------
/hne-feature-extraction/_run_gpu_qupath_stardist_singleSlide.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | source /gpfs/mskmind_ess/limr/mambaforge/etc/profile.d/conda.sh
3 | conda activate transformer
4 |
5 | singularity run --env TF_FORCE_GPU_ALLOW_GROWTH=true,PER_PROCESS_GPU_MEMORY_FRACTION=0.8 -B $(dirname $1):/data/slides,qupath/data/results:/data/results,qupath/models:/models,qupath/scripts:/scripts --nv qupath/qupath-stardist_latest.sif java -Djava.awt.headless=true \
6 | -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \
7 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \
8 | script --image /data/slides/$(basename $1) /scripts/stardist_nuclei_and_lymphocytes.groovy
9 |
--------------------------------------------------------------------------------
/hne-feature-extraction/bitmaps/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/bitmaps/.gitkeep
--------------------------------------------------------------------------------
/hne-feature-extraction/connector.py:
--------------------------------------------------------------------------------
1 | """connector.py.
2 |
3 | Interface for reading and manipulating tables from Dremio.
4 |
5 | Test with -
6 | $ python breastana/connector.py
7 | """
8 | import urllib.parse
9 | from pathlib import Path
10 | from typing import Any, Callable, Optional
11 | import sys
12 |
13 | import pandas as pd
14 | from pyarrow import flight
15 |
16 |
17 | class TableLoader:
18 | """Generic table loading interface."""
19 |
20 | def __init__(self, user: str, password: str, flight_port: int = 32010):
21 | self.mod_dict: dict[str, Any] = {}
22 | self.mod_mask = pd.DataFrame()
23 | self.user = user
24 | self.password = password
25 | self.flight_port = flight_port
26 |
27 | def load_from_dremio(self, url: Any) -> pd.DataFrame:
28 | """Load table from Dremio."""
29 | dremio_session = DremioDataframeConnector(
30 | scheme="grpc+tcp",
31 | hostname=url.hostname,
32 | flightport=self.flight_port,
33 | dremio_user=self.user,
34 | dremio_password=self.password,
35 | connection_args={},
36 | )
37 | return dremio_session.get_table(url.query)
38 |
39 |
40 | class FeatureTableLoader(TableLoader):
41 | """Trying to make loading from dremio nicer."""
42 |
43 | def add_table(
44 | self,
45 | table: str,
46 | feature_tag: str,
47 | index_cols: list[str] = ["main_index"],
48 | fx_transform: Optional[Callable[[Any], Any]] = None,
49 | check_uniqueness: bool = True,
50 | ) -> pd.DataFrame:
51 | """Add table to FeatureTableLoader object."""
52 | if not type(table) == pd.DataFrame:
53 | path = Path(table)
54 | url = urllib.parse.urlparse(table)
55 |
56 | print(f"Info: {url}")
57 |
58 | if url.scheme == "file" and path.suffix == ".csv":
59 | df = pd.read_csv(path)
60 | if "dremio" in url.scheme:
61 | df = self.load_from_dremio(url)
62 | else:
63 | df = table.reset_index()
64 |
65 | df = df.set_index(index_cols)
66 |
67 | if fx_transform is not None:
68 | df = fx_transform(df)
69 |
70 | if not len(df.index.unique()) == len(df.index) and check_uniqueness:
71 | raise RuntimeError(
72 | "Feature tables must have unique indicies post-transform",
73 | f"N={len(df.index)-len(df.index.unique())} try something else!",
74 | )
75 |
76 | self.mod_dict[feature_tag] = df
77 |
78 | return df
79 |
80 | def calculate_mask(self) -> pd.DataFrame:
81 | """Calculate index mask."""
82 | for key in self.mod_dict.keys():
83 | df = self[key]
84 | main_index = df.index.get_level_values(0)
85 |
86 | self.mod_mask = self.mod_mask.reindex(self.mod_mask.index.union(main_index))
87 | self.mod_mask.loc[main_index, key] = True
88 | self.mod_mask = self.mod_mask.fillna(False)
89 |
90 | return self.mod_mask
91 |
92 | def __getitem__(self, key: str) -> Any:
93 | """Get item."""
94 | return self.mod_dict[key]
95 |
96 | def __setitem__(self, key: str, item: Any) -> None:
97 | """Set item."""
98 | self.mod_dict[key] = item
99 |
100 |
101 | class DremioClientAuthMiddlewareFactory(flight.ClientMiddlewareFactory): # type: ignore
102 | """A factory that creates DremioClientAuthMiddleware(s)."""
103 |
104 | def __init__(self) -> None:
105 | self.call_credential: list[Any] = []
106 |
107 | def start_call(self, info: Any) -> Any: # type ignore:
108 | """Start call."""
109 | return DremioClientAuthMiddleware(self)
110 |
111 | def set_call_credential(self, call_credential: list[Any]) -> None:
112 | """Set call credentials."""
113 | self.call_credential = call_credential
114 |
115 |
116 | class DremioClientAuthMiddleware(flight.ClientMiddleware): # type: ignore
117 | """Dremio ClientMiddleware used for authentication.
118 |
119 | Extracts the bearer token from
120 | the authorization header returned by the Dremio
121 | Flight Server Endpoint.
122 |
123 | Parameters
124 | ----------
125 | factory : ClientHeaderAuthMiddlewareFactory
126 | The factory to set call credentials if an
127 | authorization header with bearer token is
128 | returned by the Dremio server.
129 | """
130 |
131 | def __init__(self, factory: DremioClientAuthMiddlewareFactory):
132 | self.factory = factory
133 |
134 | def received_headers(self, headers: dict[str, Any]) -> None:
135 | """Process header."""
136 | auth_header_key = "authorization"
137 | authorization_header: list[Any] = []
138 | for key in headers:
139 | if key.lower() == auth_header_key:
140 | authorization_header = headers.get(auth_header_key) # type: ignore
141 | self.factory.set_call_credential(
142 | [b"authorization", authorization_header[0].encode("utf-8")]
143 | )
144 |
145 |
146 | class DremioDataframeConnector:
147 | """Dremio connector.
148 |
149 | Iterfaces with a Dremio instance/cluster
150 | via Apache Arrow Flight for fast read performance.
151 |
152 | Parameters
153 | ----------
154 | scheme: connection scheme
155 | hostname: host of main dremio name
156 | flightport: which port dremio exposes to flight requests
157 | dremio_user: username to use
158 | dremio_password: associated password
159 | connection_args: anything else to pass to the FlightClient initialization
160 | """
161 |
162 | def __init__(
163 | self,
164 | scheme: str,
165 | hostname: str,
166 | flightport: int,
167 | dremio_user: str,
168 | dremio_password: str,
169 | connection_args: dict[str, Any],
170 | ):
171 | # Skipping tls...
172 |
173 | # Two WLM settings can be provided upon initial authentication
174 | # with the Dremio Server Flight Endpoint:
175 | # - routing-tag
176 | # - routing queue
177 | initial_options = flight.FlightCallOptions(
178 | headers=[
179 | (b"routing-tag", b"test-routing-tag"),
180 | (b"routing-queue", b"Low Cost User Queries"),
181 | ]
182 | )
183 | client_auth_middleware = DremioClientAuthMiddlewareFactory()
184 | client = flight.FlightClient(
185 | f"{scheme}://{hostname}:{flightport}",
186 | middleware=[client_auth_middleware],
187 | **connection_args,
188 | )
189 | self.bearer_token = client.authenticate_basic_token(
190 | dremio_user, dremio_password, initial_options
191 | )
192 | self.client = client
193 |
194 | def run(self, project: str, table_name: str) -> pd.DataFrame:
195 | """Get a fixed table.
196 |
197 | Returns the virtual table at project(or "space").table_name
198 | as a pandas dataframe
199 |
200 | Parameters
201 | ----------
202 | project: Project ID to read from
203 | table_name: Table name to load
204 |
205 | """
206 | sqlquery = f'''SELECT * FROM "{project}"."{table_name}"'''
207 |
208 | # flight_desc = flight.FlightDescriptor.for_command(sqlquery)
209 | print("[INFO] Query: ", sqlquery)
210 |
211 | options = flight.FlightCallOptions(headers=[self.bearer_token])
212 | # schema = self.client.get_schema(flight_desc, options)
213 |
214 | # Get the FlightInfo message to retrieve the Ticket corresponding
215 | # to the query result set.
216 | flight_info = self.client.get_flight_info(
217 | flight.FlightDescriptor.for_command(sqlquery), options
218 | )
219 |
220 | # Retrieve the result set as a stream of Arrow record batches.
221 | reader = self.client.do_get(flight_info.endpoints[0].ticket, options)
222 | return reader.read_pandas()
223 |
224 | def get_table(self, sqlquery: str) -> pd.DataFrame:
225 | """Run a query.
226 |
227 | Returns the virtual table at project(or "space").table_name
228 | as a pandas dataframe.
229 |
230 | Parameters
231 | ----------
232 | project: Project ID to read from
233 | table_name: Table name to load
234 |
235 | """
236 | # flight_desc = flight.FlightDescriptor.for_command(sqlquery)
237 | print("[INFO] Query: ", sqlquery)
238 |
239 | options = flight.FlightCallOptions(headers=[self.bearer_token])
240 | # schema = self.client.get_schema(flight_desc, options)
241 |
242 | # Get the FlightInfo message to retrieve the Ticket corresponding
243 | # to the query result set.
244 | flight_info = self.client.get_flight_info(
245 | flight.FlightDescriptor.for_command(sqlquery), options
246 | )
247 |
248 | # Retrieve the result set as a stream of Arrow record batches.
249 | reader = self.client.do_get(flight_info.endpoints[0].ticket, options)
250 |
251 | return reader.read_pandas()
252 |
253 |
254 | if __name__ == "__main__":
255 | import getpass
256 |
257 | # set username and password
258 | # (or Personal Access Token) when prompted at the command prompt
259 | DREMIO_USER = input("Username: ")
260 | DREMIO_PASSWORD = getpass.getpass(prompt="Password or PAT: ", stream=None)
261 |
262 | dremio_session = DremioDataframeConnector(
263 | scheme="grpc+tcp",
264 | hostname="tlvidreamcord1",
265 | flightport=32010,
266 | dremio_user=DREMIO_USER,
267 | dremio_password=DREMIO_PASSWORD,
268 | connection_args={},
269 | )
270 | query = 'SELECT merged_hne_inventory.spectrum_sample_id, merged_hne_inventory.slide_image FROM merged_hne_inventory'
271 |
272 | df = dremio_session.get_table(query)
273 | df['merged_hne_inventory.slide_image'] = df['merged_hne_inventory.slide_image'].str.removeprefix("file://")
274 | df.to_csv(sys.stdout)
275 |
--------------------------------------------------------------------------------
/hne-feature-extraction/environment.yml:
--------------------------------------------------------------------------------
1 | ../tissue-type-training/environment.yml
--------------------------------------------------------------------------------
/hne-feature-extraction/extract_feats_from_bitmaps.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import os
4 | from joblib import Parallel, delayed
5 | from PIL import Image
6 | from skimage import measure
7 | import sys
8 | sys.path.append('../tissue-type-training')
9 | import config
10 |
11 |
12 | def extract_fraction_necrosis(result_dict, class_bitmaps):
13 | necrotic_pixels = np.sum(class_bitmaps['Necrosis'])
14 | total_foreground_pixels = 0
15 | for class_, bitmap in class_bitmaps.items():
16 | total_foreground_pixels += np.sum(bitmap)
17 | feature = float(necrotic_pixels) / total_foreground_pixels
18 | result_dict['fraction_area_necrotic'] = feature
19 |
20 |
21 | def extract_ratio_necrosis_to_tumor(result_dict, class_bitmaps):
22 | necrotic_pixels = np.sum(class_bitmaps['Necrosis'])
23 | tumor_pixels = np.sum(class_bitmaps['Tumor'])
24 | if tumor_pixels > 0:
25 | feature = float(necrotic_pixels) / tumor_pixels
26 | result_dict['ratio_necrosis_to_tumor'] = feature
27 | else:
28 | pass
29 |
30 |
31 | def extract_ratio_necrosis_to_stroma(result_dict, class_bitmaps):
32 | necrotic_pixels = np.sum(class_bitmaps['Necrosis'])
33 | stroma_pixels = np.sum(class_bitmaps['Stroma'])
34 | if stroma_pixels > 0:
35 | feature = float(necrotic_pixels) / stroma_pixels
36 | result_dict['ratio_necrosis_to_stroma'] = feature
37 | else:
38 | pass
39 |
40 |
41 | def extract_shannon_entropy(result_dict, class_bitmaps, prefix=None):
42 | n = 0
43 | p = []
44 | for bitmap in class_bitmaps.values():
45 | count = np.sum(bitmap != 0)
46 | n += count
47 | p.append(count)
48 | p = np.array(p)
49 | p = p/n
50 |
51 | shannon_entropy = 0
52 | for prob in p:
53 | if prob > 0:
54 | shannon_entropy -= prob * np.log2(prob)
55 |
56 | if prefix:
57 | key_name = '{}_shannon_entropy'.format(prefix)
58 | else:
59 | key_name = 'shannon_entropy'
60 |
61 | result_dict[key_name] = shannon_entropy
62 |
63 |
64 | def extract_tumor_stroma_entropy(result_dict, class_bitmaps_):
65 | class_bitmaps = {'Tumor': class_bitmaps_['Tumor'],
66 | 'Stroma': class_bitmaps_['Stroma']}
67 | return extract_shannon_entropy(result_dict, class_bitmaps, prefix='Tumor_Stroma')
68 |
69 |
70 | def get_classwise_regionprops(result_dict, class_bitmaps):
71 | for class_, bitmap in class_bitmaps.items():
72 | features = _get_single_class_regionprops(bitmap, class_)
73 | result_dict.update(features)
74 |
75 | # get regionprops for largest connected component
76 | largest_cc_map = _get_single_class_largest_cc_bitmap(bitmap)
77 | if largest_cc_map is not None:
78 | features = _get_single_class_regionprops(largest_cc_map, '_'.join([class_,
79 | 'largest_component']))
80 | result_dict.update(features)
81 |
82 |
83 | def _get_single_class_regionprops(class_bitmap, class_label):
84 | features = {}
85 | properties = measure.regionprops(class_bitmap)
86 | if properties:
87 | properties = properties[0]
88 | else:
89 | return {'_'.join([class_label, 'area']): 0}
90 |
91 | features['_'.join([class_label, 'area'])] = properties.area
92 | features['_'.join([class_label, 'convex_area'])] = properties.convex_area
93 | features['_'.join([class_label, 'eccentricity'])] = properties.eccentricity
94 | features['_'.join([class_label, 'equivalent_diameter'])] = properties.equivalent_diameter
95 | features['_'.join([class_label, 'euler_number'])] = properties.euler_number
96 | features['_'.join([class_label, 'extent'])] = properties.extent
97 | # features['_'.join([class_label, 'feret_diameter_max'])] = properties.feret_diameter_max
98 | # features['_'.join([class_label, 'filled_area'])] = properties.filled_area
99 | features['_'.join([class_label, 'major_axis_length'])] = properties.major_axis_length
100 | features['_'.join([class_label, 'minor_axis_length'])] = properties.minor_axis_length
101 | features['_'.join([class_label, 'perimeter'])] = properties.perimeter
102 | # features['_'.join([class_label, 'perimeter_crofton'])] = properties.perimeter_crofton
103 | features['_'.join([class_label, 'solidity'])] = properties.solidity
104 | features['_'.join([class_label, 'PA_ratio'])] = properties.perimeter / float(properties.area)
105 | return features
106 | # print(features)
107 | # exit()
108 |
109 |
110 | def _get_single_class_largest_cc_bitmap(bitmap):
111 | labels, n = measure.label(bitmap, return_num=True)
112 |
113 | largest_area = 0
114 | associated_label = -1
115 | for label in range(1, n):
116 | area = np.sum(labels == label)
117 | if area > largest_area:
118 | largest_area = area
119 | associated_label = label
120 | if associated_label == -1:
121 | return None
122 | else:
123 | return (labels == label).astype(int)
124 |
125 |
126 | def extract_feats(dir_, slide_name, class_list):
127 | class_bitmaps = dict()
128 | for class_ in class_list:
129 | class_bitmaps[class_] = np.array(Image.open(os.path.join(dir_, slide_name, class_ + '.png'))).squeeze()
130 |
131 | result_ = dict()
132 | get_classwise_regionprops(result_, class_bitmaps)
133 | extract_fraction_necrosis(result_, class_bitmaps)
134 | extract_ratio_necrosis_to_tumor(result_, class_bitmaps)
135 | extract_ratio_necrosis_to_stroma(result_, class_bitmaps)
136 | extract_shannon_entropy(result_, class_bitmaps)
137 | extract_tumor_stroma_entropy(result_, class_bitmaps)
138 | return {slide_name: result_}
139 |
140 |
141 | if __name__ == '__main__':
142 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '')
143 | bitmap_dir = 'bitmaps/{}'.format(checkpoint_name)
144 | feat_df_filename = 'tissue_tile_features/{}.csv'.format(checkpoint_name)
145 | SERIAL = False
146 |
147 | slide_list = os.listdir(bitmap_dir)
148 |
149 | map_key = {'Stroma': 0,
150 | 'Tumor': 1,
151 | 'Fat': 2,
152 | 'Necrosis': 3}
153 | classes = list(map_key.keys())
154 |
155 | results = {}
156 | if SERIAL:
157 | for slide in slide_list:
158 | print(slide)
159 | result = extract_feats(bitmap_dir, slide, classes)
160 | results.update(result)
161 | else:
162 | dicts = Parallel(n_jobs=32)(delayed(extract_feats)(bitmap_dir, slide, classes) for slide in slide_list)
163 | for dict_ in dicts:
164 | results.update(dict_)
165 |
166 | df = pd.DataFrame(results).T
167 | print(df)
168 |
169 | df = df.reset_index().rename(columns={'index': 'image_id'})
170 | df.to_csv(feat_df_filename, index=False)
171 |
172 |
--------------------------------------------------------------------------------
/hne-feature-extraction/extract_feats_from_object_detections.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import os
4 | import yaml
5 | from joblib import Parallel, delayed
6 | import sys
7 | sys.path.append('../tissue-type-training')
8 | import config
9 |
10 |
11 | def get_density(obj_feats, regional_feats, result_dict, parent, class_):
12 | parent_area = regional_feats['{}_area'.format(parent)].item() * 64 # scale factor for 1/16 downsampling and 0.5 µm / pixel
13 | # print(parent_area)
14 | obj_mask = (obj_feats.Parent == parent) & (obj_feats.Class == class_)
15 | # print(obj_mask)
16 | obj_count = obj_mask.sum()
17 | # print(obj_count)
18 | key_ = '{}_{}_density'.format(parent, class_)
19 | try:
20 | result_dict[key_] = float(obj_count) / parent_area
21 | except ZeroDivisionError:
22 | result_dict[key_] = np.nan
23 |
24 |
25 | def get_quantiles(object_feats, mask, feat, output_feat_name, results_dict):
26 | results_dict[output_feat_name.format('mean')] = object_feats.loc[mask, feat].mean()
27 | for quantile in np.arange(0.1, 1, 0.1):
28 | results_dict[output_feat_name.format('quantile{:2.1f}'.format(quantile))] = object_feats.loc[mask, feat].quantile(quantile)
29 | results_dict[output_feat_name.format('var')] = object_feats.loc[mask, feat].var()
30 | results_dict[output_feat_name.format('skew')] = object_feats.loc[mask, feat].skew()
31 | results_dict[output_feat_name.format('kurtosis')] = object_feats.loc[mask, feat].kurtosis()
32 |
33 |
34 | def extract_feats(object_feat_fn, regional_feature_df_, slide_id):
35 | regional_feature_df = regional_feature_df_[regional_feature_df_.index.astype(str) == str(slide_id)]
36 | if len(regional_feature_df) != 1:
37 | return {}
38 |
39 | result_ = {}
40 | object_feats = pd.read_csv(object_feat_fn)
41 | if len(object_feats.columns) == 1:
42 | object_feats = pd.read_csv(object_feat_fn, delimiter='\t')
43 | object_feats = object_feats[object_feats['Detection probability'] > DETECTION_PROB_THRESHOLD]
44 | # print(object_feats.Parent.value_counts())
45 | get_density(object_feats,
46 | regional_feature_df,
47 | result_,
48 | parent='Tumor',
49 | class_='Lymphocyte'
50 | )
51 | get_density(object_feats,
52 | regional_feature_df,
53 | result_,
54 | parent='Tumor',
55 | class_='Other'
56 | )
57 | get_density(object_feats,
58 | regional_feature_df,
59 | result_,
60 | parent='Necrosis',
61 | class_='Other'
62 | )
63 | get_density(object_feats,
64 | regional_feature_df,
65 | result_,
66 | parent='Stroma',
67 | class_='Lymphocyte'
68 | )
69 | get_density(object_feats,
70 | regional_feature_df,
71 | result_,
72 | parent='Stroma',
73 | class_='Other'
74 | )
75 | get_quantiles(object_feats=object_feats,
76 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
77 | feat='Nucleus: Area µm^2',
78 | output_feat_name='Tumor_Other_{}_nuclear_area',
79 | results_dict=result_)
80 |
81 | get_quantiles(object_feats=object_feats,
82 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
83 | feat='Nucleus: Circularity',
84 | output_feat_name='Tumor_Other_{}_nuclear_circularity',
85 | results_dict=result_)
86 |
87 | get_quantiles(object_feats=object_feats,
88 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
89 | feat='Nucleus: Solidity',
90 | output_feat_name='Tumor_Other_{}_nuclear_solidity',
91 | results_dict=result_)
92 |
93 | get_quantiles(object_feats=object_feats,
94 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
95 | feat='Nucleus: Max diameter µm',
96 | output_feat_name='Tumor_Other_{}_nuclear_max_diameter',
97 | results_dict=result_)
98 |
99 | get_quantiles(object_feats=object_feats,
100 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
101 | feat='Hematoxylin: Nucleus: Mean',
102 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_mean',
103 | results_dict=result_)
104 |
105 | get_quantiles(object_feats=object_feats,
106 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
107 | feat='Hematoxylin: Nucleus: Median',
108 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_median',
109 | results_dict=result_)
110 |
111 | get_quantiles(object_feats=object_feats,
112 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
113 | feat='Hematoxylin: Nucleus: Min',
114 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_min',
115 | results_dict=result_)
116 |
117 | get_quantiles(object_feats=object_feats,
118 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
119 | feat='Hematoxylin: Nucleus: Max',
120 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_max',
121 | results_dict=result_)
122 |
123 | get_quantiles(object_feats=object_feats,
124 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
125 | feat='Hematoxylin: Nucleus: Std.Dev.',
126 | output_feat_name='Tumor_Other_{}_nuclear_hematoxylin_stdDev',
127 | results_dict=result_)
128 |
129 | get_quantiles(object_feats=object_feats,
130 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
131 | feat='Eosin: Nucleus: Mean',
132 | output_feat_name='Tumor_Other_{}_nuclear_eosin_mean',
133 | results_dict=result_)
134 |
135 | get_quantiles(object_feats=object_feats,
136 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
137 | feat='Eosin: Nucleus: Median',
138 | output_feat_name='Tumor_Other_{}_nuclear_eosin_median',
139 | results_dict=result_)
140 |
141 | get_quantiles(object_feats=object_feats,
142 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
143 | feat='Eosin: Nucleus: Min',
144 | output_feat_name='Tumor_Other_{}_nuclear_eosin_min',
145 | results_dict=result_)
146 |
147 | get_quantiles(object_feats=object_feats,
148 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
149 | feat='Eosin: Nucleus: Max',
150 | output_feat_name='Tumor_Other_{}_nuclear_eosin_max',
151 | results_dict=result_)
152 |
153 | get_quantiles(object_feats=object_feats,
154 | mask=(object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other'),
155 | feat='Eosin: Nucleus: Std.Dev.',
156 | output_feat_name='Tumor_Other_{}_nuclear_eosin_stdDev',
157 | results_dict=result_)
158 |
159 | try:
160 | result_['ratio_Tumor_Lymphocyte_to_Tumor_Other'] = float(
161 | ((object_feats.Parent == 'Tumor') & (object_feats.Class == 'Lymphocyte')).sum()) / float(
162 | ((object_feats.Parent == 'Tumor') & (object_feats.Class == 'Other')).sum()
163 | )
164 | except ZeroDivisionError:
165 | result_['ratio_Tumor_Lymphocyte_to_Tumor_Other'] = np.nan
166 |
167 | return {int(slide_id): result_}
168 |
169 |
170 | if __name__ == '__main__':
171 |
172 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '')
173 |
174 | regional_feat_df_filename = 'tissue_tile_features/{}.csv'.format(checkpoint_name)
175 |
176 | object_detection_dir = 'final_objects/{}'.format(checkpoint_name)
177 |
178 | merged_feat_df_filename = 'tissue_tile_features/{}_merged.csv'.format(checkpoint_name)
179 | SERIAL = True
180 | DETECTION_PROB_THRESHOLD = 0.5
181 |
182 | slide_list = [x for x in os.listdir(object_detection_dir) if (('.csv' in x) or ('.tsv' in x))]
183 | regional_feat_df = pd.read_csv(regional_feat_df_filename).set_index('image_id')
184 |
185 | results = {}
186 | if SERIAL:
187 | for slide in slide_list:
188 | print(slide)
189 | result = extract_feats(os.path.join(object_detection_dir, slide), regional_feat_df, slide[:-4])
190 | results.update(result)
191 | else:
192 | dicts = Parallel(n_jobs=32)(delayed(extract_feats)(os.path.join(object_detection_dir, slide), regional_feat_df, slide[:-4]) for slide in slide_list)
193 | for dict_ in dicts:
194 | results.update(dict_)
195 |
196 | df = pd.DataFrame(results).T
197 | df.index = df.index.astype('str')
198 | df = df.join(regional_feat_df, how='inner')
199 | print(df)
200 |
201 | df = df.reset_index().rename(columns={'index': 'image_id'})
202 | df['image_id'] = df['image_id'].astype(str)
203 |
204 | hne_df = pd.read_csv(config.args.preprocessed_cohort_csv_path)
205 | hne_df['image_id'] = hne_df['image_path'].apply(lambda x: x.split('/')[-1][:-4]).astype(str)
206 | df = df.join(hne_df[['image_id', 'Patient ID', 'n_foreground_tiles']].set_index('image_id'), on='image_id', how='left')
207 | df = df.drop(columns=['image_id'])
208 | df['Patient ID'] = df['Patient ID'].astype(str).apply(lambda x: x.zfill(3))
209 | df = df.fillna(df.median())
210 | df.to_csv(merged_feat_df_filename, index=False)
211 |
212 |
--------------------------------------------------------------------------------
/hne-feature-extraction/final_objects/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/final_objects/.gitkeep
--------------------------------------------------------------------------------
/hne-feature-extraction/hne-feature-extraction.dag:
--------------------------------------------------------------------------------
1 | JOB FIRST 1_infer_tissue_types_and_extract_features.sub
2 | JOB SECOND 2_extract_objects.sub
3 | JOB THIRD 3_label_objects_and_extract_features.sub
4 | PARENT FIRST CHILD SECOND
5 | PARENT SECOND CHILD THIRD
6 |
--------------------------------------------------------------------------------
/hne-feature-extraction/infer_tissue_tile_clf.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import torch.nn as nn
3 | import numpy as np
4 | import torch
5 | import csv
6 | import os
7 | from torch.utils.data import DataLoader
8 | import torch.multiprocessing as mp
9 |
10 | import sys
11 | sys.path.append('../tissue-type-training')
12 | from general_utils import setup, get_val_transforms
13 | import config
14 | from models import load_tissue_tile_net
15 | from train_tissue_tile_clf import prep_df, make_preds
16 | from dataset import TissueTileDataset
17 |
18 |
19 | def make_preds_by_slide(model, df, device, file_dir, n_classes):
20 | header = ['tile_file_name', 'label']
21 | header.extend(['score_{}'.format(k) for k in range(n_classes)])
22 |
23 | for image_path, sub_df in df.groupby('image_path'):
24 | dataset = TissueTileDataset(df=sub_df,
25 | tile_dir=config.args.tile_dir,
26 | transforms=get_val_transforms())
27 | loader = DataLoader(dataset,
28 | batch_size=config.args.batch_size,
29 | num_workers=4)
30 |
31 | file_name = os.path.join(file_dir, image_path.split('/')[-1][:-4] + '.csv')
32 |
33 | with open(file_name, 'w', newline='') as file:
34 | writer = csv.writer(file, delimiter=',')
35 | writer.writerow(header)
36 |
37 | with torch.no_grad():
38 | for ids, tiles, labels in loader:
39 | preds = model(tiles.to(device))
40 | preds = preds.detach().cpu().tolist()
41 | for idx, label, pred_list in zip(ids, labels.tolist(), preds):
42 | row = [idx, label]
43 | row.extend(pred_list)
44 | writer.writerow(row)
45 |
46 |
47 | def get_fold_slides(df, world_size, rank):
48 | all_slides = df.image_path.unique()
49 | chunks = np.array_split(all_slides, world_size)
50 | return chunks[rank]
51 |
52 |
53 | def distribute(rank, world_size, df_, n_classes, val_dir):
54 | setup(rank, world_size)
55 | device_ids = [config.args.gpu[rank]]
56 | device = torch.device('cuda:{}'.format(device_ids[0]))
57 | print('distributed to device {}'.format(str(device)))
58 |
59 | df = df_[df_.image_path.isin(get_fold_slides(df_, world_size, rank))]
60 |
61 | model = load_tissue_tile_net(config.args.checkpoint_path, activation=nn.Softmax(dim=1), n_classes=n_classes)
62 | model.to(device)
63 | model.eval()
64 |
65 | make_preds_by_slide(model,
66 | df,
67 | device,
68 | val_dir,
69 | n_classes)
70 |
71 |
72 | def serialize(df, n_classes, val_dir):
73 | model = load_tissue_tile_net(config.args.checkpoint_path, activation=nn.Softmax(), n_classes=n_classes)
74 |
75 | device = torch.device('cuda:{}'.format(config.args.gpu[0]))
76 | model.to(device)
77 | model.eval()
78 |
79 | make_preds_by_slide(model,
80 | df,
81 | device,
82 | val_dir,
83 | n_classes)
84 |
85 |
86 | if __name__ == '__main__':
87 | assert config.args.checkpoint_path
88 |
89 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '')
90 | inference_dir = 'inference/{}'.format(checkpoint_name)
91 | if not os.path.exists(inference_dir):
92 | os.mkdir(inference_dir)
93 |
94 | world_size_ = len(config.args.gpu)
95 | df, n_classes, _, _ = prep_df(config.args.preprocessed_cohort_csv_path, tile_dir=config.args.tile_dir, map_classes=False)
96 |
97 | if world_size_ == 1:
98 | serialize(df, n_classes, inference_dir)
99 | else:
100 | mp.spawn(distribute,
101 | args=(world_size_, df, n_classes, inference_dir),
102 | nprocs=world_size_,
103 | join=True)
104 |
--------------------------------------------------------------------------------
/hne-feature-extraction/inference/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/inference/.gitkeep
--------------------------------------------------------------------------------
/hne-feature-extraction/map_inference_to_bitmap.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import os
4 | import yaml
5 | from joblib import Parallel, delayed
6 | from openslide import OpenSlide
7 | from openslide.lowlevel import OpenSlideUnsupportedFormatError
8 | from skimage.draw import rectangle_perimeter, rectangle
9 | from PIL import Image
10 | import sys
11 | sys.path.append('../tissue-type-training')
12 | import general_utils
13 | import config
14 |
15 | def convert_to_bitmap(slide_path, bitmap_dir, inference_dir, scale, map_key):
16 | slide_id = slide_path.split('/')[-1][:-4]
17 | slide_bitmap_subdir = os.path.join(bitmap_dir, slide_id)
18 | if not os.path.exists(slide_bitmap_subdir):
19 | os.mkdir(slide_bitmap_subdir)
20 | map_reverse_key = dict([(v, k) for k, v in map_key.items()])
21 |
22 | # load thumbnail
23 | slide = OpenSlide(slide_path)
24 |
25 | slide_mag = general_utils.get_magnification(slide)
26 | scale = general_utils.adjust_scale_for_slide_mag(slide_mag=slide_mag,
27 | desired_mag=config.args.magnification,
28 | scale=scale)
29 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale)
30 | overlap = int(config.args.overlap // scale)
31 |
32 | # create bitmaps
33 | bitmaps = {}
34 | for key, val in map_key.items():
35 | bitmaps[key] = np.zeros(thumbnail.shape[:2], dtype=np.uint8)
36 |
37 | # load tile class inference csv, create tile_address and predicted_class column
38 | df = pd.read_csv(os.path.join(inference_dir, slide_id + '.csv'))
39 | df['predicted_class'] = df.drop(columns=['label', 'tile_file_name']).idxmax(axis='columns').str.replace(
40 | 'score_', '').astype(int)
41 | df['address'] = df['tile_file_name'].apply(
42 | lambda x: [int(y) for y in x.replace('.png', '').split('/')[1].split('_')])
43 | # for each tile, populate the associated area in the bitmap with pred_class
44 |
45 | generator, generator_level = general_utils.get_full_resolution_generator(
46 | general_utils.array_to_slide(thumbnail),
47 | tile_size=config.desired_otsu_thumbnail_tile_size,
48 | overlap=overlap)
49 |
50 | for address, class_number in zip(df.address, df.predicted_class):
51 | extent = generator.get_tile_dimensions(generator_level, address)
52 | start = (address[1] * config.desired_otsu_thumbnail_tile_size,
53 | address[0] * config.desired_otsu_thumbnail_tile_size)
54 |
55 | class_label = map_reverse_key[class_number]
56 | _thumbnail = bitmaps[class_label]
57 | rr, cc = rectangle(start=start, extent=extent, shape=_thumbnail.shape)
58 | _thumbnail[rr, cc] = 255
59 |
60 | # save bitmaps
61 | for class_label, bitmap in bitmaps.items():
62 | _thumbnail = Image.fromarray(bitmap)
63 | _thumbnail.save(os.path.join(slide_bitmap_subdir, class_label + '.png'))
64 |
65 | # generate and save overlay
66 | vals = list(map_key.values())
67 | range_ = [np.min(vals), np.max(vals)]
68 | thumbnail = general_utils.visualize_tile_scoring(thumbnail,
69 | config.desired_otsu_thumbnail_tile_size,
70 | df.address.tolist(),
71 | df.predicted_class.tolist(),
72 | overlap=overlap,
73 | range_=range_)
74 | thumbnail = Image.fromarray(thumbnail)
75 | thumbnail = general_utils.label_image_tissue_type(thumbnail, map_key)
76 | thumbnail.save(os.path.join(slide_bitmap_subdir, '_overlay.png'))
77 |
78 |
79 | if __name__ == '__main__':
80 | checkpoint_name = config.args.checkpoint_path.split('/')[-1].replace('.torch', '')
81 | inference_dir = 'inference/{}'.format(checkpoint_name)
82 | bitmap_dir = 'bitmaps/{}'.format(checkpoint_name)
83 | if not os.path.exists(bitmap_dir):
84 | os.mkdir(bitmap_dir)
85 |
86 | df = pd.read_csv(config.args.preprocessed_cohort_csv_path)
87 | with open('../global_config.yaml', 'r') as f:
88 | DIRECTORIES = yaml.safe_load(f)
89 | DATA_DIR = DIRECTORIES['data_dir']
90 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x))
91 |
92 | scale_factor = config.args.tile_size / config.desired_otsu_thumbnail_tile_size
93 |
94 | map_key = {'Stroma': 0,
95 | 'Tumor': 1,
96 | 'Fat': 2,
97 | 'Necrosis': 3}
98 |
99 | for slide_path in df['image_path']:
100 | print(slide_path)
101 | convert_to_bitmap(slide_path, bitmap_dir, inference_dir, scale_factor, map_key)
102 |
103 | # Parallel(n_jobs=32)(delayed(convert_to_bitmap)(slide, bitmap_dir, inference_dir, scale_factor, map_key) for slide in slide_list)
104 |
--------------------------------------------------------------------------------
/hne-feature-extraction/merge_cells_and_regions.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import os
3 | from PIL import Image
4 | import numpy as np
5 | from joblib import Parallel, delayed
6 | import sys
7 | sys.path.append('../tissue-type-training')
8 | import config
9 |
10 |
11 | def visualize_cell_detections(df_, size, fn):
12 | marker_size = 2
13 | max_cells = 1000
14 | arr = np.zeros(size).astype(np.uint8)
15 | arr[:,:] = 0
16 | if len(df_) > max_cells:
17 | df = df_.sample(max_cells)
18 | else:
19 | df = df_.copy()
20 | for index, row in df.iterrows():
21 | arr[(row.CentroidY-marker_size):(row.CentroidY+marker_size), (row.CentroidX-marker_size):(row.CentroidX+marker_size)] = 255
22 | img = Image.fromarray(arr)
23 | img.save(fn)
24 |
25 | # now visualize by class
26 | map_ = {'Tumor': np.array([0, 255, 0]),
27 | 'Stroma': np.array([0, 191, 255]),
28 | 'Necrosis': np.array([255, 0, 0]),
29 | 'Fat': np.array([255, 255, 0]),
30 | 'Unknown': np.array([255, 255, 255])}
31 | arr = np.zeros((size[0], size[1], 3)).astype(np.uint8)
32 | for index, row in df.iterrows():
33 | if row.Parent == 'Unknown':
34 | continue
35 | arr[(row.CentroidY-marker_size):(row.CentroidY+marker_size), (row.CentroidX-marker_size):(row.CentroidX+marker_size)] = map_[row.Parent]
36 | img = Image.fromarray(arr)
37 | img.save(fn.replace('.png', '_classified.png'))
38 |
39 |
40 | def load_tissue_maps(_dir, _classes):
41 | bbox_map = None
42 | d = {}
43 | for class_ in _classes:
44 | img_fn = os.path.join(_dir, class_ + '.png')
45 | if not os.path.exists(img_fn):
46 | return None, None
47 | img = Image.open(img_fn)
48 | img = np.array(img)
49 | img = (img != 0).astype(np.uint8)
50 | if bbox_map is None:
51 | bbox_map = img.astype(bool)
52 | else:
53 | bbox_map = bbox_map.astype(bool) | img.astype(bool)
54 | d[class_] = img
55 | img_size = img.shape
56 | #print(d[class_].shape)
57 | #print(np.argmin(bbox_map, axis=0).min())
58 | #print(np.argmax(bbox_map, axis=0).max())
59 | #print(np.argmin(bbox_map, axis=1).min())
60 | #print(np.argmax(bbox_map, axis=1).max())
61 | return d, img_size
62 |
63 |
64 | def load_cell_detections(fn, scale_factor, shape):
65 | object_detections = pd.read_csv(fn, delimiter='\t')
66 | #print(object_detections)
67 | object_detections = object_detections.rename(columns={list(object_detections.columns)[5]: 'CentroidX', list(object_detections.columns)[6]: 'CentroidY'})
68 | object_detections['CentroidX'] =(object_detections['CentroidX'] // scale_factor).astype(int)
69 | object_detections['CentroidY'] = (object_detections['CentroidY'] // scale_factor).astype(int)
70 | #print(object_detections['CentroidX'].describe())
71 | #print(object_detections['CentroidY'].describe())
72 | object_detections.loc[object_detections.CentroidX >= shape[1], 'CentroidX'] = shape[1] - 1
73 | object_detections.loc[object_detections.CentroidY >= shape[0], 'CentroidY'] = shape[0] - 1
74 | object_detections['Parent'] = 'Unknown'
75 | return object_detections
76 |
77 |
78 | def assign_cell_parents(object_detections, tissue_regional_maps):
79 | for class_, tissue_map in tissue_regional_maps.items():
80 | #print(class_)
81 | object_detections = object_detections.apply(lambda x: _assign_single_class(x, tissue_map, class_), axis=1)
82 | return object_detections
83 |
84 |
85 | def _assign_single_class(row, tissue_map, tissue_type_name):
86 | tissue_map_val = tissue_map[row.CentroidY, row.CentroidX]
87 | if tissue_map_val != 0:
88 | row['Parent'] = tissue_type_name
89 | return row
90 |
91 | def process_slide(slide_id):
92 | tissue_maps, img_size = load_tissue_maps(os.path.join(region_detection_dir, slide_id.split('.')[0]), classes)
93 | output_fn = os.path.join(output_dir, slide_id + '.csv')
94 | if os.path.exists(output_fn):
95 | print("{} exists; skipping".format(output_fn))
96 | return
97 |
98 | if tissue_maps is None:
99 | print('{} has no tissue_maps; skipping'.format(slide_id))
100 | return
101 |
102 | if True: #try:
103 | cell_detections = load_cell_detections(os.path.join(object_detection_dir, slide_id + '.tsv'), object_coords_to_region_coords_scale_factor, img_size)
104 | cell_detections = assign_cell_parents(cell_detections, tissue_maps)
105 | #print(cell_detections.Parent.value_counts())
106 | print('processed {}'.format(slide_id))
107 | visualize_cell_detections(cell_detections, img_size, os.path.join(VIZ_DIR, slide_id + '.png'))
108 | cell_detections = cell_detections[cell_detections.Parent != 'Unknown']
109 | cell_detections.to_csv(output_fn, index=False)
110 |
111 |
112 | checkpoint_id = config.args.checkpoint_path.split('/')[-1].replace('.torch', '')
113 | object_detection_dir = 'qupath/data/results'
114 | region_detection_dir = 'bitmaps/{}'.format(checkpoint_id)
115 |
116 | slide_ids = [x[:-4] for x in os.listdir(object_detection_dir) if '.tsv' in x]
117 | classes = ['Tumor', 'Stroma', 'Fat', 'Necrosis']
118 |
119 | # bitmaps are generated at 4/128 = 1/32 resolution in pixel coordinates
120 | # cells are detected in µmcoordinates at full resolution. 1 pixel = 05um
121 | object_coords_to_region_coords_scale_factor = 16.096
122 |
123 | output_dir = 'final_objects/{}'.format(checkpoint_id)
124 | if not os.path.exists(output_dir):
125 | os.mkdir(output_dir)
126 |
127 | VIZ_DIR = 'visualizations/{}'.format(checkpoint_id)
128 | if not os.path.exists(VIZ_DIR):
129 | os.mkdir(VIZ_DIR)
130 |
131 |
132 | if __name__ == '__main__':
133 | Parallel(n_jobs=64)(delayed(process_slide)(slide_id) for slide_id in slide_ids)
134 | #for slide_id in slide_ids:
135 | # process_slide(slide_id)
136 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/Makefile:
--------------------------------------------------------------------------------
1 | export MYUID := $(shell id -u)
2 | export MYGID := $(shell id -g)
3 | export SINGULARITY_CACHEDIR := $(PWD)/.singularity/cache
4 | export SINGULARITY_TMPDIR := $(PWD)/.singularity/tmp
5 | export SINGULARITY_LOCALCACHEDIR := $(PWD)/.singularity/lcache
6 | IMAGE = docker://druvpatel/qupath-stardist:latest
7 |
8 | .PHONY: all help build clean test run
9 | .DEFAULT_GOAL := help
10 |
11 | # help menu
12 | define BROWSER_PYSCRIPT
13 | import os, webbrowser, sys
14 |
15 | try:
16 | from urllib import pathname2url
17 | except:
18 | from urllib.request import pathname2url
19 |
20 | webbrowser.open("file://" + pathname2url(os.path.abspath(sys.argv[1])))
21 | endef
22 | export BROWSER_PYSCRIPT
23 |
24 | define PRINT_HELP_PYSCRIPT
25 | import re, sys
26 |
27 | for line in sys.stdin:
28 | match = re.match(r'^([a-zA-Z_-]+):.*?## (.*)$$', line)
29 | if match:
30 | target, help = match.groups()
31 | print("%-20s %s" % (target, help))
32 | endef
33 | export PRINT_HELP_PYSCRIPT
34 |
35 |
36 | BROWSER := python -c "$$BROWSER_PYSCRIPT"
37 |
38 | help:
39 | @python -c "$$PRINT_HELP_PYSCRIPT" < $(MAKEFILE_LIST)
40 |
41 |
42 | build: ## build docker image
43 | docker build -t qupath/latest .
44 |
45 |
46 | init:
47 | ./init_singularity_env.sh
48 |
49 |
50 | clean: ## cleanup
51 | docker system prune
52 |
53 |
54 | clean-images: ## remove qupath-stardist docker and singularity images. WARNING: Will affect other users!
55 | docker rmi -f $(IMAGE)
56 | singularity cache clean -f
57 | rm qupath-stardist_latest.sif
58 |
59 |
60 | run-cpu: ## run a script within the star_dist service. This run uses cpus.
61 | docker-compose -f docker-compose.yml run \
62 | star_dist \
63 | java -Djava.awt.headless=true -Djava.library.path=/qupath-cpu/build/dist/QuPath-0.2.3/lib/app \
64 | -jar /qupath-cpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \
65 | script --image /$(image) /$(script)
66 |
67 | run-gpu: ## run a script within the star_dist service. This run uses gpus.
68 | docker-compose -f docker-compose.yml run \
69 | -e TF_FORCE_GPU_ALLOW_GROWTH=true -e PER_PROCESS_GPU_MEMORY_FRACTION=0.8 \
70 | star_dist \
71 | java -Djava.awt.headless=true -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \
72 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \
73 | script --image /$(image) /$(script)
74 |
75 |
76 | build-singularity: init ## pulls qupath-stardist image and converts it to a singularity image
77 | singularity pull --force $(IMAGE)
78 | singularity cache list -v
79 |
80 |
81 | run-singularity-gpu: ## runs the qupath-stardist image in gpu mode
82 | singularity run --env TF_FORCE_GPU_ALLOW_GROWTH=true,PER_PROCESS_GPU_MEMORY_FRACTION=0.8 -B $(PWD)/data:/data,$(PWD)/detections:/detections,$(PWD)/models:/models,$(PWD)/scripts:/scripts --nv $(PWD)/qupath-stardist_latest.sif java -Djava.awt.headless=true \
83 | -Djava.library.path=/qupath-gpu/build/dist/QuPath-0.2.3/lib/app \
84 | -jar /qupath-gpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \
85 | script --image /$(image) /$(script)
86 |
87 | run-singularity-cpu: ## runs the qupath-stardist image in cpu mode
88 | singularity run -B $(PWD)/data:/data,$(PWD)/detections:/detections,$(PWD)/models:/models,$(PWD)/scripts:/scripts $(PWD)/qupath-stardist_latest.sif java -Djava.awt.headless=true \
89 | -Djava.library.path=/qupath-cpu/build/dist/QuPath-0.2.3/lib/app \
90 | -jar /qupath-cpu/build/dist/QuPath-0.2.3/lib/app/qupath-0.2.3.jar \
91 | script --image /$(image) /$(script)
92 |
93 |
94 |
95 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/README.md:
--------------------------------------------------------------------------------
1 | # QuPath bundled with StarDist and Tensorflow
2 |
3 | Sample data in data/sample_data has been downloaded from http://openslide.cs.cmu.edu/download/openslide-testdata/.
4 | More test data from various vendors can be found on this site.
5 |
6 | This repo contains a containerized version of QuPath+StarDist with GPU support. These containers can be built and run using Docker (Section 2) or Singularity (Section 1).
7 |
8 | # Overview
9 |
10 | ## What is QuPath?
11 | QuPath is a digital image analysis platform, that can be quite useful when it comes to analyzing pathology images. Qupath runs using Groovy-based scripts, which can be run through the UI, or in this case, headless through a docker container.
12 |
13 | ## What is StarDist?
14 | Stardist is a nuclear segmentation algorithm that is quite capable in detecting and segmenting cells/nuclei in pathology images. It runs using a tensorflow backend, and has some prebuilt models available to perform cellular segmentation in H&E and IF images.
15 |
16 | When running Stardist in Qupath, nuclear/cellular objects will be created as well as a dictionary of per-cell features such as staining properties ((hematoxylin and eosin staining metrics for H&E), and geometric properties (size, shape, lengths, etc)
17 |
18 | ## How to write and run your own scripts
19 | Some example scripts have been provided to demonstrate some of the functionalities of the QuPath groovy scripting interface.
20 |
21 | `stardist_example.groovy` --> This script will run the StarDist cellular segmentation algorithm based on the given parameters in the file. This will result in cellular objects being created, as well as a dictionary of per-cell features. This script will also show how these cell objects can be exported into two different formats -- geojson and tsv. Exporting in geojson will export each cell's vertices outlining the cell segmentation, but this also means this file can be quite large. On the other hand, TSV does not retain the polygon cellular outlines but just the coordinates of the centroid which is much more compact.
22 |
23 | `detection_first_run_hne_stardist_segmentation.groovy` --> This is a more advanced script that combines multiple aspects of QuPath. It runs StarDist segmentations, as well as a cellular classifier which is able to classify these cellular objects into various classes (in this case lymphocyte vs other cell phenotypes). In addition, this script also performs whole-slide pixel classification using a basic model. The unique part about this script is that upon export, the cellular objects will contain a class (lymphocyte vs other) as well as a parent class (the regional annotation label that the cell objet is in based on the results of the pixel classifier)
24 |
25 | # Section 1: Building and Running Image with Singularity
26 | This image has been prebuilt on Dockerhub (as a docker image) to run via singularity.
27 |
28 | Git clone this repo.
29 |
30 | ``
31 | $ git clone https://github.com/msk-mind/docker.git
32 | ``
33 |
34 | Change directory to this dir and initialize the local singularity environment. `init_singularity_env.sh` creates a localized singularity env in the current directory by creating a `.singularity` sub-directory. This script will need to be re-executed each time you start a new shell and want to run singularity commands against the localized environment directly from the shell, as opposed to using the targets in the makefile.
35 |
36 | ```
37 | $ cd qupath
38 | $ ./init_singularity_env.sh
39 | ```
40 |
41 | Build the singularity image.
42 |
43 | ``
44 | $ make build-singularity
45 | ``
46 |
47 | Run the singularity image by specifying the script and image arguments. Like the docker image, the command for executing the container has been designed to use the 'data', 'scripts', 'detections', and 'models' directories to map these files to the container file system. These directories and files must be specified as relative paths. Any data that needs to be referenced outside of detections/, data/, scripts/, and models/ should be mounted using the -B command. To do this, append the new mount (comma separated) to the -B argument in the makefile under run-singularity-cpu and/or run-singularity-gpu as follows: /path/on/host:/bind/path/on/container.
48 |
49 | If successful, `stardist_example.groovy` will output a geo coordinates of cell objects (centroids) to `detections/CMU-1-Small-Region_2_stardist_detections_and_measurements.tsv`
50 |
51 | To run with CPUs: use `run-singularity-cpu`, and for GPUs use `run-singularity-gpu`.
52 |
53 | The first time the `run-singularity-gpu` make target is executed, it tends to take about 4 minutes. Subsequent runs tend to take 20-30 sec.
54 |
55 | Note: adding hosts is not currently supported with the singularity build.
56 |
57 | ```
58 | $ time make \
59 | script=scripts/sample_scripts/stardist_example.groovy \
60 | image=data/sample_data/CMU-1-Small-Region_2.svs run-singularity-gpu
61 | ```
62 |
63 | To restart with a clean slate, simply delete the `.singularity` sub-directory and re-initialize the local singularity environment by executing `init_singularity_env.sh`.
64 |
65 |
66 | # Section 2: Building and Running Image with Docker (WIP)
67 | ## Section 2: Part 1 -- Build image using Dockerfile
68 |
69 | For building with Docker, there is a small setup step that needs to be done. Using the following links, download these .deb files and copy them to this directory `docker/qupath/`. You will have to create a developer nvidia account in order to do this.
70 |
71 | 1) libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb:
72 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb
73 |
74 | 2) libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb
75 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb
76 |
77 | 3) libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb
78 | https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb
79 |
80 |
81 | Once you've downloaded these files, you can proceed with the build:
82 | ```
83 | $ make build
84 | $ docker images
85 |
86 | REPOSITORY TAG IMAGE ID CREATED SIZE
87 | qupath/latest latest ead3bb08477d About a minute ago 2.4GB
88 | adoptopenjdk/openjdk14 x86_64-debian-jdk-14.0.2_12 9350dbb3ad77 4 days ago 516MB
89 | ```
90 |
91 | ## Section 2: Part 2 -- Run QuPath groovy script using built Docker container
92 |
93 |
94 | The command for executing the container has been designed to use the 'data', 'detections', 'scripts' and 'models' directories to map these files to the container file system. These directories and files must be specified as relative paths.
95 |
96 | If the script uses an external api, the url and IP of the api must be provided to the continer using the host argument. The host IP can be obtained from the URL using the nslookup linux command. Two or more hosts may be specified by specifying multiple host arguments. If the script does not use any external api, the host argument must be specified at least once with an empty string since it is a required argument.
97 |
98 | To run with CPUs: use `run-cpu`, and use GPUs use `run-gpu`.
99 |
100 | Examples:
101 |
102 | This script can be used to import annotationss from the getPathologyAnnotations API
103 | ```
104 | make host="--add-host=" \
105 | script=scripts/sample_scripts/import_annot_from_api.groovy \
106 | image=data/sample_data/HobI20-934829783117.svs run-cpu
107 | ```
108 |
109 | If successful, `stardist_example.groovy` will output a geojson of cell objects to data/test.geojson
110 | ```
111 | make host="" \
112 | script=scripts/sample_scripts/hne/stardist_example.groovy \
113 | image=data/sample_data/CMU-1-Small-Region_2.svs run-gpu
114 | ```
115 |
116 |
117 |
118 | ## Section 2: Part 3 -- Cleanup Docker container
119 | Cleans stopped/exited containers, unused networks, dangling images, dangling build caches
120 |
121 | ```
122 | $ make clean
123 | ```
124 |
125 | ## WIP/TODOs
126 | - Infrastructure: currently uses single GPU but allocates all, future job scheduler with GPU allocator would be great to fully utiilize available GPUs. (To come with Condor)
127 | - Get GPU working for the docker image (in the event we need to use docker instead of singularity)
128 |
129 |
130 | ## Logs
131 | - started with adoptopenjdk:openjdk14
132 |
133 | - ImageWriterIJTest > testTiffAndZip() FAILED
134 | java.awt.HeadlessException at ImageWriterIJTest.java:46
135 | 4 tests completed, 1 failed
136 |
137 | so, excluded tests with `gradle ... -x test` (see dockerfile)
138 |
139 | - Error: java.io.IOException: Cannot run program "objcopy": error=2, No such file or directory
140 |
141 | so, installed binutils (see dockerfile)
142 |
143 | - 17:18:31.139 [main] [ERROR] q.l.i.s.o.OpenslideServerBuilder - Could not load OpenSlide native libraries
144 | java.lang.UnsatisfiedLinkError: /qupath/build/dist/QuPath-0.2.3/lib/app/libopenslide-jni.so: libxml2.so.2: cannot open shared object file: No such file or directory
145 |
146 | so, installed native libs (see dockerfile)
147 |
148 | - cleanup stopped/exited conntainers, networks, dangling images, dangling build cache
149 | docker system prune
150 |
151 |
152 |
153 |
154 |
155 | ` `
156 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/data/results/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/data/results/.gitkeep
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/data/slides/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/data/slides/.gitkeep
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/docker-compose.yml:
--------------------------------------------------------------------------------
1 | # docker-compose.yml
2 | # Used to ensure container is executed using host user id and gid (so files written by container are owned by user), and docker image remains user agnostic.
3 | # https://medium.com/faun/set-current-host-user-for-docker-container-4e521cef9ffc
4 | version: '3'
5 | services:
6 | star_dist:
7 | image: docker.io/druvpatel/qupath-stardist
8 | user: $MYUID:$MYGID
9 | working_dir: $PWD
10 | stdin_open: true
11 | volumes:
12 | - $PWD/data:/data
13 | - $PWD/models:/models
14 | - $PWD/scripts:/scripts
15 | - $PWD/detections:/detections
16 | tty: true
17 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/dockerfile:
--------------------------------------------------------------------------------
1 |
2 | # https://hub.docker.com/r/adoptopenjdk/openjdk14
3 | FROM adoptopenjdk/openjdk14:x86_64-debian-jdk-14.0.2_12
4 | MAINTAINER MSK-MIND
5 | LABEL "app"="qupath"
6 | LABEL "version"="0.2.3"
7 | LABEL "description"="qupath bundled with stardist and tensorflow for CPUs. Change -Ptensorflow-cpu=false to switch to GPUs"
8 | RUN apt-get update && \
9 | apt-get install -y sudo tree wget git vim && \
10 | apt-get install -y gnupg2 && \
11 | apt-get install -y binutils && \
12 | apt-get install -y libxml2-dev libtiff-dev libglib2.0-0 libxcb-shm0-dev libxrender-dev libxcb-render0-dev && \
13 | apt-get install -y libgl1-mesa-glx && \
14 | apt-get install -y software-properties-common
15 |
16 | #install cuda 10.2
17 | RUN apt-get update && wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-ubuntu1604.pin && \
18 | sudo mv cuda-ubuntu1604.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
19 | sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \
20 | sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/ /" && \
21 | sudo apt-get update && \
22 | sudo apt-get -y install cuda-libraries-10-2
23 |
24 | RUN echo 'export PATH=/usr/local/cuda-10.2/bin:$PATH' >> ~/.bashrc && \
25 | echo 'export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc && \
26 | . ~/.bashrc
27 |
28 |
29 | # install cudnn7 for cuda 10.2
30 | ## these can be found on nvidia's website after logging in through a developer account
31 | ## to build this docker image manually, download these files and copy them to the /qupath directory (the same directory as this dockerfile)
32 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7_7.6.5.32-1%2Bcuda10.2_amd64.deb
33 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-dev_7.6.5.32-1%2Bcuda10.2_amd64.deb
34 | ## https://developer.nvidia.com/compute/machine-learning/cudnn/secure/7.6.5.32/Production/10.2_20191118/Ubuntu16_04-x64/libcudnn7-doc_7.6.5.32-1%2Bcuda10.2_amd64.deb
35 | COPY libcudnn7_7.6.5.32-1+cuda10.2_amd64.deb .
36 | COPY libcudnn7-dev_7.6.5.32-1+cuda10.2_amd64.deb .
37 | COPY libcudnn7-doc_7.6.5.32-1+cuda10.2_amd64.deb .
38 | RUN apt-get install -y dpkg-dev && apt install ./libcudnn7_7.6.5.32-1+cuda10.2_amd64.deb && \
39 | apt install ./libcudnn7-dev_7.6.5.32-1+cuda10.2_amd64.deb && \
40 | apt install ./libcudnn7-doc_7.6.5.32-1+cuda10.2_amd64.deb
41 |
42 |
43 | RUN /bin/sh -c export DISPLAY=:0.0
44 | RUN /bin/sh -c cd / && mkdir qupath-gpu && \
45 | git clone --branch v0.2.3 https://github.com/qupath/qupath.git /qupath-gpu && \
46 | cd /qupath-gpu && \
47 | ./gradlew clean build createPackage -x test -Ptensorflow-gpu=true
48 |
49 | RUN /bin/sh -c cd / && mkdir qupath-cpu && \
50 | git clone --branch v0.2.3 https://github.com/qupath/qupath.git /qupath-cpu && \
51 | cd /qupath-cpu && \
52 | ./gradlew clean build createPackage -x test -Ptensorflow-cpu=true
53 |
54 | RUN apt-get update && apt-get install -y libnccl-dev
55 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/init_singularity_env.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | mkdir -p .singularity/cache
4 | mkdir -p .singularity/tmp
5 | mkdir -p .singularity/lcache
6 |
7 | export SINGULARITY_CACHEDIR=$PWD/.singularity/cache
8 | export SINGULARITY_TMPDIR=$PWD/.singularity/tmp
9 | export SINGULARITY_LOCALCACHEDIR=$PWD/.singularity/lcache
10 |
11 | echo using SINGULARITY_CACHEDIR = $SINGULARITY_CACHEDIR
12 | echo using SINGULARITY_TMPDIR = $SINGULARITY_TMPDIR
13 | echo using SINGULARITY_LOCALCACHEDIR= $SINGULARITY_LOCALCACHEDIR
14 |
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/models/ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json:
--------------------------------------------------------------------------------
1 | {
2 | "object_classifier_type": "OpenCVMLClassifier",
3 | "featureExtractor": {
4 | "feature_extractor_type": "NormalizedFeatureExtractor",
5 | "featureExtractor": {
6 | "feature_extractor_type": "DefaultFeatureExtractor",
7 | "measurements": [
8 | "Detection probability",
9 | "Nucleus: Area µm^2",
10 | "Nucleus: Length µm",
11 | "Nucleus: Circularity",
12 | "Nucleus: Solidity",
13 | "Nucleus: Max diameter µm",
14 | "Nucleus: Min diameter µm",
15 | "Cell: Area µm^2",
16 | "Cell: Length µm",
17 | "Cell: Circularity",
18 | "Cell: Solidity",
19 | "Cell: Max diameter µm",
20 | "Cell: Min diameter µm",
21 | "Nucleus/Cell area ratio",
22 | "Hematoxylin: Nucleus: Mean",
23 | "Hematoxylin: Nucleus: Median",
24 | "Hematoxylin: Nucleus: Min",
25 | "Hematoxylin: Nucleus: Max",
26 | "Hematoxylin: Nucleus: Std.Dev.",
27 | "Hematoxylin: Cytoplasm: Mean",
28 | "Hematoxylin: Cytoplasm: Median",
29 | "Hematoxylin: Cytoplasm: Min",
30 | "Hematoxylin: Cytoplasm: Max",
31 | "Hematoxylin: Cytoplasm: Std.Dev.",
32 | "Hematoxylin: Membrane: Mean",
33 | "Hematoxylin: Membrane: Median",
34 | "Hematoxylin: Membrane: Min",
35 | "Hematoxylin: Membrane: Max",
36 | "Hematoxylin: Membrane: Std.Dev.",
37 | "Hematoxylin: Cell: Mean",
38 | "Hematoxylin: Cell: Median",
39 | "Hematoxylin: Cell: Min",
40 | "Hematoxylin: Cell: Max",
41 | "Hematoxylin: Cell: Std.Dev.",
42 | "Eosin: Nucleus: Mean",
43 | "Eosin: Nucleus: Median",
44 | "Eosin: Nucleus: Min",
45 | "Eosin: Nucleus: Max",
46 | "Eosin: Nucleus: Std.Dev.",
47 | "Eosin: Cytoplasm: Mean",
48 | "Eosin: Cytoplasm: Median",
49 | "Eosin: Cytoplasm: Min",
50 | "Eosin: Cytoplasm: Max",
51 | "Eosin: Cytoplasm: Std.Dev.",
52 | "Eosin: Membrane: Mean",
53 | "Eosin: Membrane: Median",
54 | "Eosin: Membrane: Min",
55 | "Eosin: Membrane: Max",
56 | "Eosin: Membrane: Std.Dev.",
57 | "Eosin: Cell: Mean",
58 | "Eosin: Cell: Median",
59 | "Eosin: Cell: Min",
60 | "Eosin: Cell: Max",
61 | "Eosin: Cell: Std.Dev."
62 | ]
63 | },
64 | "normalizer": {
65 | "offsets": [
66 | 0.0,
67 | 0.0,
68 | 0.0,
69 | 0.0,
70 | 0.0,
71 | 0.0,
72 | 0.0,
73 | 0.0,
74 | 0.0,
75 | 0.0,
76 | 0.0,
77 | 0.0,
78 | 0.0,
79 | 0.0,
80 | 0.0,
81 | 0.0,
82 | 0.0,
83 | 0.0,
84 | 0.0,
85 | 0.0,
86 | 0.0,
87 | 0.0,
88 | 0.0,
89 | 0.0,
90 | 0.0,
91 | 0.0,
92 | 0.0,
93 | 0.0,
94 | 0.0,
95 | 0.0,
96 | 0.0,
97 | 0.0,
98 | 0.0,
99 | 0.0,
100 | 0.0,
101 | 0.0,
102 | 0.0,
103 | 0.0,
104 | 0.0,
105 | 0.0,
106 | 0.0,
107 | 0.0,
108 | 0.0,
109 | 0.0,
110 | 0.0,
111 | 0.0,
112 | 0.0,
113 | 0.0,
114 | 0.0,
115 | 0.0,
116 | 0.0,
117 | 0.0,
118 | 0.0,
119 | 0.0
120 | ],
121 | "scales": [
122 | 1.0,
123 | 1.0,
124 | 1.0,
125 | 1.0,
126 | 1.0,
127 | 1.0,
128 | 1.0,
129 | 1.0,
130 | 1.0,
131 | 1.0,
132 | 1.0,
133 | 1.0,
134 | 1.0,
135 | 1.0,
136 | 1.0,
137 | 1.0,
138 | 1.0,
139 | 1.0,
140 | 1.0,
141 | 1.0,
142 | 1.0,
143 | 1.0,
144 | 1.0,
145 | 1.0,
146 | 1.0,
147 | 1.0,
148 | 1.0,
149 | 1.0,
150 | 1.0,
151 | 1.0,
152 | 1.0,
153 | 1.0,
154 | 1.0,
155 | 1.0,
156 | 1.0,
157 | 1.0,
158 | 1.0,
159 | 1.0,
160 | 1.0,
161 | 1.0,
162 | 1.0,
163 | 1.0,
164 | 1.0,
165 | 1.0,
166 | 1.0,
167 | 1.0,
168 | 1.0,
169 | 1.0,
170 | 1.0,
171 | 1.0,
172 | 1.0,
173 | 1.0,
174 | 1.0,
175 | 1.0
176 | ],
177 | "missingValue": 0.0
178 | }
179 | },
180 | "classifier": {
181 | "class": "ANN_MLP",
182 | "statmodel": {
183 | "format": 3,
184 | "layer_sizes": [
185 | 54,
186 | 2
187 | ],
188 | "activation_function": "SIGMOID_SYM",
189 | "f_param1": 1.0,
190 | "f_param2": 1.0,
191 | "min_val": -9.4999999999999996e-01,
192 | "max_val": 9.4999999999999996e-01,
193 | "min_val1": -9.7999999999999998e-01,
194 | "max_val1": 9.7999999999999998e-01,
195 | "training_params": {
196 | "train_method": "RPROP",
197 | "dw0": 1.0000000000000001e-01,
198 | "dw_plus": 1.2000000000000000e+00,
199 | "dw_minus": 5.0000000000000000e-01,
200 | "dw_min": 1.1920928955078125e-07,
201 | "dw_max": 50.0,
202 | "term_criteria": {
203 | "epsilon": 1.0000000000000000e-02,
204 | "iterations": 1000
205 | }
206 | },
207 | "input_scale": [
208 | 1.3884230478630293e+01,
209 | -1.0436028848950381e+01,
210 | 4.0651217206917785e-02,
211 | -1.5288830675717675e+00,
212 | 1.4902576440765389e-01,
213 | -3.2632152740591982e+00,
214 | 1.8651759142111512e+01,
215 | -1.6932751758711024e+01,
216 | 7.9975348973840283e+01,
217 | -7.9813989744728346e+01,
218 | 3.7380041101211908e-01,
219 | -2.9905965766939682e+00,
220 | 6.1167716922545190e-01,
221 | -3.5192343231666916e+00,
222 | 2.3319211712556121e-02,
223 | -2.2874284833089127e+00,
224 | 1.3240275045470298e-01,
225 | -4.8788867016263282e+00,
226 | 1.5880131397602792e+01,
227 | -1.3770291753516185e+01,
228 | 5.1799179370354359e+01,
229 | -5.0826996462122764e+01,
230 | 3.5865866097388122e-01,
231 | -4.7469990799870470e+00,
232 | 4.6927276840148247e-01,
233 | -4.5527534416253266e+00,
234 | 1.0147320988075766e+01,
235 | -3.7268640560457684e+00,
236 | 3.7563079961938541e+00,
237 | -2.8222394651682250e+00,
238 | 3.4207509789013120e+00,
239 | -2.6126520844583649e+00,
240 | 7.2903065872421591e+00,
241 | -1.1929507138294884e+00,
242 | 2.0236865315993695e+00,
243 | -2.8701281475600342e+00,
244 | 8.9732545773844841e+00,
245 | -2.2544507476912941e+00,
246 | 9.2196282204976701e+00,
247 | -2.0842894032683126e+00,
248 | 8.4876590975436681e+00,
249 | -1.5404532839449356e+00,
250 | 2.1340521185877950e+01,
251 | 6.4541155503805547e-01,
252 | 3.0137932665358207e+00,
253 | -2.9686044300429262e+00,
254 | 1.5334500805912315e+01,
255 | -2.8641689072376773e+00,
256 | 8.6606817415675295e+00,
257 | -1.7829889272624488e+00,
258 | 8.5503568388131157e+00,
259 | -1.4597032278579616e+00,
260 | 1.9717646673081735e+01,
261 | 3.0540728997987365e-01,
262 | 3.1075686746922222e+00,
263 | -2.1133265088121331e+00,
264 | 1.2863514501373606e+01,
265 | -2.0644588022774628e+00,
266 | 7.9272915122578809e+00,
267 | -3.2034772169770411e+00,
268 | 6.1492929262431808e+00,
269 | -1.8573746533587634e+00,
270 | 2.1697724885291631e+01,
271 | 6.7287375171150243e-01,
272 | 2.0260795259491804e+00,
273 | -2.8956706968133008e+00,
274 | 7.4670422801461367e+00,
275 | -2.4551901654398010e+00,
276 | 1.7603024815704330e+01,
277 | -2.1020666846556337e+00,
278 | 1.7723557439925614e+01,
279 | -2.0074159355618573e+00,
280 | 4.8008125885821906e+00,
281 | 9.8948794236022031e-01,
282 | 3.0647364222699682e+00,
283 | -1.4691812331912364e+00,
284 | 1.2562629775839415e+01,
285 | -1.5394854943914542e+00,
286 | 1.8172876240023772e+01,
287 | -1.8730041731417946e+00,
288 | 1.7708167986592180e+01,
289 | -1.8311933989238971e+00,
290 | 8.8517673621658144e+00,
291 | 8.2400096007605295e-01,
292 | 7.7696295729603495e+00,
293 | -2.0826594164420631e+00,
294 | 4.8397922833946005e+01,
295 | -2.6474752710474303e+00,
296 | 1.7803035271910812e+01,
297 | -1.8877641496042963e+00,
298 | 1.7244050875851336e+01,
299 | -1.8057484529102026e+00,
300 | 1.3645091260729304e+01,
301 | 2.5527867627681750e-01,
302 | 9.2220882392120895e+00,
303 | -2.1773014024968322e+00,
304 | 4.4806689473518418e+01,
305 | -2.3295121210659278e+00,
306 | 1.8800432115096598e+01,
307 | -2.0465181349344106e+00,
308 | 1.8251381185567340e+01,
309 | -1.9319916323956488e+00,
310 | 4.7829401642175453e+00,
311 | 1.0379696088234613e+00,
312 | 3.0625094741185959e+00,
313 | -1.4977509313276964e+00,
314 | 2.2666122841938190e+01,
315 | -1.9651871233359419e+00
316 | ],
317 | "output_scale": [
318 | 1.0,
319 | 0.0,
320 | 1.0,
321 | 0.0
322 | ],
323 | "inv_output_scale": [
324 | 1.0,
325 | 0.0,
326 | 1.0,
327 | 0.0
328 | ],
329 | "weights": [
330 | [
331 | 5.3131502607590453e-01,
332 | 1.3319032953543575e-01,
333 | 7.5677286629269147e-02,
334 | 3.1806455147243787e-03,
335 | -7.5296388482771714e-01,
336 | 1.7410383250467160e-01,
337 | -6.5718326911217329e-01,
338 | 5.6077796882772435e-01,
339 | 1.4355113845598161e-01,
340 | -2.7589820288151073e-01,
341 | -9.3726938824941830e-01,
342 | 1.1273909427692799e+00,
343 | 3.3792224510181301e-01,
344 | 1.2580719918406363e-01,
345 | -7.4447272743560278e-01,
346 | 5.0350170458677901e-02,
347 | -8.1057370658860572e-01,
348 | 6.4570399303329262e-01,
349 | -2.8815758124418311e-01,
350 | 1.0613828078319948e-01,
351 | 3.7224267919660070e-01,
352 | -9.2478641209280732e-02,
353 | -3.2678283297318611e-01,
354 | 1.3640023219146697e+00,
355 | -1.0364842630454227e+00,
356 | 1.6912933555400111e+00,
357 | -5.3676523232830187e-01,
358 | 7.4875001179266654e-01,
359 | 1.1938863816812013e+00,
360 | -2.1728246328279646e+00,
361 | 1.9005019638922551e+00,
362 | -3.6124597246245034e+00,
363 | -1.6750112022431682e-01,
364 | -1.9736721466993229e-01,
365 | -5.3750827455159578e-03,
366 | 1.0342167736052412e+00,
367 | -1.3496084665863162e-01,
368 | -1.3093091792621530e+00,
369 | -1.5330208939125556e-01,
370 | -7.7381676828638113e-03,
371 | 5.1686721932381874e-01,
372 | 6.3168367752957066e-01,
373 | -7.9753752002407829e-01,
374 | 6.6204874496166144e-01,
375 | -2.6004031347000062e-01,
376 | -4.2408458497567647e-01,
377 | -5.4686580456146239e-01,
378 | 4.2267154081098107e-01,
379 | 6.2117431785880362e-01,
380 | 7.0263537699839507e-01,
381 | -1.1724467699660965e+00,
382 | -3.7546113575704887e-01,
383 | -9.4388902234390426e-01,
384 | -6.1731725412889404e-01,
385 | -7.3112030637524750e-01,
386 | 2.4387987377295198e-01,
387 | 9.6708275345891925e-01,
388 | 1.5507609635086583e-01,
389 | -5.2843267148099748e-01,
390 | -1.2043873253617352e-01,
391 | -6.0703943680523076e-01,
392 | 4.1308862491398889e-02,
393 | 2.2855201109314574e-01,
394 | 1.1902682401126872e+00,
395 | -1.4586465862716863e-01,
396 | -5.8694397098993642e-02,
397 | 2.0510988973135409e+00,
398 | -4.4058807822921127e+00,
399 | 5.3717924422466412e-01,
400 | 1.0892272856150961e-01,
401 | -4.5976354260767632e-01,
402 | -4.5710278828830914e-02,
403 | -6.7111726958554385e-02,
404 | -1.1266965359361265e-01,
405 | 1.3495583249953398e+00,
406 | -1.0543581371422708e+00,
407 | 6.7783428476582031e-01,
408 | -7.5820186675059664e-01,
409 | 6.8271281498259717e-01,
410 | -5.8791671161933907e-02,
411 | -9.3443263461570902e-01,
412 | -9.4366261563277237e-02,
413 | -6.5883480364552061e-01,
414 | -6.4001380261849916e-01,
415 | 1.0451022662285130e-01,
416 | 4.1599372517711841e-01,
417 | -5.1432979559771852e-01,
418 | 1.5904643675018923e-01,
419 | -8.4432473820839860e-01,
420 | 4.1337428950857474e-01,
421 | 9.8918057798433190e-01,
422 | 9.2360472791201131e-01,
423 | 7.1413128861014574e-01,
424 | 2.5961010759479131e-01,
425 | -7.9186179150659919e-01,
426 | 3.1202167349121934e-01,
427 | -3.3639543745691192e-01,
428 | 4.5762746282061900e-01,
429 | 9.5306022142570823e-01,
430 | -2.5812084340957475e-01,
431 | -3.7019001215589353e-01,
432 | -4.3609373559489228e-01,
433 | 5.9580211866483324e-01,
434 | -7.7101094798609293e-01,
435 | 3.8444305971371939e-01,
436 | -1.8652039138823739e+00,
437 | 6.6781601499775434e-01,
438 | -2.6964085469021148e-01,
439 | -1.7095248939777175e+00,
440 | 1.9560948934635687e+00
441 | ]
442 | ]
443 | }
444 | },
445 | "pathClasses": [
446 | {
447 | "name": "Lymphocyte",
448 | "colorRGB": -14336
449 | },
450 | {
451 | "name": "Other",
452 | "colorRGB": -16744320
453 | }
454 | ],
455 | "requestProbabilityEstimate": false,
456 | "filter": "DETECTIONS_ALL",
457 | "timestamp": 1612801125986
458 | }
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/models/he_heavy_augment/saved_model.pb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/saved_model.pb
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.data-00000-of-00001
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/qupath/models/he_heavy_augment/variables/variables.index
--------------------------------------------------------------------------------
/hne-feature-extraction/qupath/scripts/stardist_nuclei_and_lymphocytes.groovy:
--------------------------------------------------------------------------------
1 | import qupath.tensorflow.stardist.StarDist2D
2 | import qupath.lib.io.GsonTools
3 | import static qupath.lib.gui.scripting.QPEx.*
4 | setImageType('BRIGHTFIELD_H_E');
5 | setColorDeconvolutionStains('{"Name" : "H&E default", "Stain 1" : "Hematoxylin", "Values 1" : "0.60968 0.65246 0.4501 ", "Stain 2" : "Eosin", "Values 2" : "0.21306 0.87722 0.43022 ", "Background" : " 243 243 243 "}');
6 |
7 | def imageData = getCurrentImageData()
8 | def server = imageData.getServer()
9 |
10 | // get dimensions of slide
11 | minX = 0
12 | minY = 0
13 | maxX = server.getWidth()
14 | maxY = server.getHeight()
15 |
16 | print 'maxX' + maxX
17 | print 'maxY' + maxY
18 |
19 | // create rectangle roi (over entire area of image) for detections to be run over
20 | def plane = ImagePlane.getPlane(0, 0)
21 | def roi = ROIs.createRectangleROI(minX, minY, maxX-minX, maxY-minY, plane)
22 | def annotationROI = PathObjects.createAnnotationObject(roi)
23 | addObject(annotationROI)
24 | selectAnnotations();
25 | def pathModel = '/models/he_heavy_augment'
26 | def cell_expansion_factor = 3.0
27 | def cellConstrainScale = 1.0
28 | def stardist = StarDist2D.builder(pathModel)
29 | .threshold(0.5) // Probability (detection) threshold
30 | .normalizePercentiles(1, 99) // Percentile normalization
31 | .pixelSize(0.5) // Resolution for detection
32 | .cellExpansion(cell_expansion_factor) // Approximate cells based upon nucleus expansion
33 | .cellConstrainScale(cellConstrainScale) // Constrain cell expansion using nucleus size
34 | .measureShape() // Add shape measurements
35 | .measureIntensity() // Add cell measurements (in all compartments)
36 | .includeProbability(true) // Add probability as a measurement (enables later filtering)
37 | .nThreads(10)
38 | .build()
39 | // select rectangle object created
40 | selectObjects {
41 | //Some criteria here
42 | return it == annotationROI
43 | }
44 | def pathObjects = getSelectedObjects()
45 | print 'Selected ' + pathObjects.size()
46 | // stardist segmentations
47 | stardist.detectObjects(imageData, pathObjects)
48 | def celldetections = getDetectionObjects()
49 | print 'Detected' + celldetections.size()
50 | selectDetections();
51 | // obj classifier
52 | runObjectClassifier("/models/ANN_StardistSeg3.0CellExp1.0CellConstraint_AllFeatures_LymphClassifier.json")
53 | def filename = GeneralTools.getNameWithoutExtension(server.getMetadata().getName())
54 | saveDetectionMeasurements('/data/results/' + filename + '.tsv')
55 |
--------------------------------------------------------------------------------
/hne-feature-extraction/visualizations/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/hne-feature-extraction/visualizations/.gitkeep
--------------------------------------------------------------------------------
/license.md:
--------------------------------------------------------------------------------
1 | **Onco-Fusion Terms of Use**
2 |
3 | **PLEASE READ THIS DOCUMENT CAREFULLY BEFORE YOU ACCESS OR USE Onco-Fusion. BY ACCESSING ANY PORTION OF Onco-Fusion, YOU AGREE TO BE BOUND BY THE TERMS AND CONDITIONS SET FORTH BELOW. IF YOU DO NOT WISH TO BE BOUND BY THESE TERMS AND CONDITIONS, PLEASE DO NOT ACCESS Onco-Fusion.**
4 |
5 | **Onco-Fusion** is developed and maintained by Memorial Sloan Kettering Cancer Center ("MSK," "we", or "us") to **extract features from histopathologic and radiologic imaging and study their contributions to multimodal prognostic models.** MSK may, from time to time, update the software and other content on
6 | **https://github.com/kmboehm/onco-fusion** ("Content"). MSK makes no warranties or representations, and hereby disclaims any warranties, express or implied, with respect to any of the Content, including as to the present accuracy, completeness, timeliness, adequacy, or usefulness of any of the Content. The entire risk as to the quality and performance of the Content is with you. By using this Content, you agree that MSK will not be liable for any losses or damages arising from your use of or reliance on the Content, or other websites or information to which this Content may be linked, including any general, special, incidental or consequential damages arising out of the use or inability to use the Content including, but not limited to, loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the Content to operate with any other software, programs, source code, etc.
7 |
8 | By making any use of the Content or submitting any information or data along with such to us, you authorize MSK to copy, modify, display, distribute, perform, use, publish, and otherwise exploit the same for any and all purposes, all without compensation to you, for as long as we decide (collectively, the "Use Rights"). In addition, you authorize MSK to grant any third party some or all of the Use Rights. By way of example, and not limitation, the Use Rights include the right for us to publish any data or information submitted in whole or in part for as long as we choose. By providing any data or information, you represent and warrant that (i) you own all rights in and to the information or data (including any related intellectual property rights) or have sufficient authority and right to provide the content and to grant the Use Rights; (ii) your submission of the information or data and grant to us of Use Rights do not violate or conflict with the rights of other persons, or breach your obligations to other persons; and (iii) the information or data does not include or contain any personally identifiable information (PII) or protected health information (PHI).
9 |
10 | DO NOT submit personally identifiable information (PII) or protected health information (PHI) in connection with any information or data or otherwise.
11 |
12 | You may use **Onco-Fusion** , the underlying content, and any output therefrom for personal for academic research and noncommercial purposes only, including teaching and research at universities, colleges and other educational institutions. You may not use it for any other purpose. You may not publish the Content in any capacity, including in scientific or academic journals or literature, or the results of such research, without MSK's express written permission. You may not otherwise redistribute or share the Content with any third party, in part or in whole, for any purpose, without the express written permission of MSK. Any use of the Content for commercial purposes, including but not restricted to consulting activities, design of commercial hardware or software products, and a commercial entity participating in research projects, requires MSK's express written permission and provision of an appropriate license.
13 |
14 | Without limiting the generality of the foregoing, you may not use any part of the **Onco-Fusion** , the underlying Content or the output for any other purpose, including:
15 |
16 | 1. use or incorporation into a commercial product or towards the performance of a commercial service;
17 | 2. research use in a commercial setting;
18 | 3. diagnosis, treatment or use for patient care or the provision of medical services; or
19 | 4. generation of reports in a medical, laboratory, hospital or other patient care setting.
20 |
21 | You may not copy, transfer, reproduce, modify, sell, sublicense, distribute or create derivative works of **Onco-Fusion** or the underlying Content for any commercial purpose without the express permission of MSK. Any attempt otherwise to copy, transfer, modify, sublicense or distribute the Content is void, and will automatically terminate your rights under these Terms of Use.
22 |
23 | The output of **Onco-Fusion** and the underlying Content is not a substitute for professional medical help, judgment, or advice. Use of **Onco-Fusion** does not create a physician-patient relationship or in any way make a person a patient of MSK. A physician or other qualified health provider should always be consulted for any health problem or medical condition.
24 |
25 | Neither these Terms or Use nor the availability of **Onco-Fusion** should be understood to create an obligation or expectation that MSK will continue to make **Onco-Fusion** available. MSK may discontinue or restrict the availability of **Onco-Fusion** at any time. MSK may also modify these Terms or Use at any time.
26 |
27 | Any use of the Content is subject to MSK's intellectual property rights, including any granted, pending or filed provisional patent applications, and any other patent, copyright, trademark, trade secret or other intellectual property rights. MSK respects the intellectual property rights of others, just as it expects others to respect its intellectual property. If you believe that any content (including data or information uploaded by you) to the Content or other activity taking place on the website where it is hosted constitutes infringement of a work protected by copyright, please notify us as follows:
28 |
29 | Email **boehmk@mskcc.org**.
30 |
31 | Your notice must comply with the Digital Millennium Copyright Act (17 U.S.C. §512) (the "DMCA"). Upon receipt of a compliant notice, we will respond and proceed in accordance with the DMCA.
32 |
33 | By using **Onco-Fusion** , you consent to the jurisdiction and venue of the state and federal courts located in New York City, New York, USA, for any claims related to or arising from your use of **Onco-Fusion** or your violation of these Terms of Use, and agree that you will not bring any claims against MSK that relate to or arise from the foregoing except in those courts.
34 |
35 | If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of these Terms of Use, they do not excuse you from the conditions of these Terms of Use.
36 |
37 | If any provision of these Terms of Use is held to be invalid or unenforceable, then such provision shall be struck and the remaining provisions shall be enforced. Headings are for reference purposes only and in no way define, limit, construe, or describe the scope or extent of such section. MSK's failure to act with respect to a breach by you or others does not waive its right to act with respect to subsequent or similar breaches. This agreement and the terms and conditions contained herein set forth the entire understanding and agreement between MSK and you with respect to the subject matter hereof and supersede any prior or contemporaneous understanding, whether written or oral.
38 |
39 | Inquiries about the Content should be directed to **boehmk@mskcc.org**.
40 |
41 | If you are interested in using **Onco-Fusion** for purposes beyond those permitted by these Terms of Use, please contact **boehmk@mskcc.org** to inquire concerning the availability of a license.
42 |
43 | #151032716\_v1
44 |
--------------------------------------------------------------------------------
/survival-modeling/environment.yml:
--------------------------------------------------------------------------------
1 | name: sklearn
2 | channels:
3 | - conda-forge
4 | - bioconda
5 | - defaults
6 | dependencies:
7 | - affine=2.3.0=py_0
8 | - albumentations=0.5.2=pyhd8ed1ab_0
9 | - aom=3.2.0=he49afe7_2
10 | - astor=0.8.1=pyh9f0ad1d_0
11 | - attrs=21.2.0=pyhd8ed1ab_0
12 | - autograd=1.3=py_0
13 | - autograd-gamma=0.5.0=pyh9f0ad1d_0
14 | - bleach=4.1.0=pyhd8ed1ab_0
15 | - blosc=1.21.0=he49afe7_0
16 | - bokeh=2.4.2=py39h6e9494a_0
17 | - boost-cpp=1.74.0=hff03dee_4
18 | - brotli=1.0.9=h0d85af4_6
19 | - brotli-bin=1.0.9=h0d85af4_6
20 | - brotlipy=0.7.0=py39h89e85a6_1003
21 | - brunsli=0.1=h046ec9c_0
22 | - bzip2=1.0.8=h0d85af4_4
23 | - c-ares=1.18.1=h0d85af4_0
24 | - c-blosc2=2.0.4=ha1a4663_1
25 | - ca-certificates=2021.10.8=h033912b_0
26 | - cairo=1.16.0=he43a7df_1008
27 | - certifi=2021.10.8=py39h6e9494a_1
28 | - cffi=1.15.0=py39he338e87_0
29 | - cfitsio=3.470=h01dc385_7
30 | - chardet=4.0.0=py39h6e9494a_2
31 | - charls=2.2.0=h046ec9c_0
32 | - click=7.1.2=pyh9f0ad1d_0
33 | - click-plugins=1.1.1=py_0
34 | - cligj=0.7.2=pyhd8ed1ab_1
35 | - cloudpickle=2.0.0=pyhd8ed1ab_0
36 | - colorama=0.4.4=pyh9f0ad1d_0
37 | - colorcet=3.0.0=pyhd8ed1ab_0
38 | - cryptography=36.0.0=py39h209aa08_0
39 | - curl=7.80.0=hf45b732_1
40 | - cycler=0.11.0=pyhd8ed1ab_0
41 | - cytoolz=0.11.2=py39h89e85a6_1
42 | - dask-core=2021.12.0=pyhd8ed1ab_0
43 | - et_xmlfile=1.0.1=py_1001
44 | - expat=2.4.1=he49afe7_0
45 | - ffmpeg=4.4.1=h79e7b16_0
46 | - fontconfig=2.13.1=h10f422b_1005
47 | - fonttools=4.28.3=py39h89e85a6_0
48 | - formulaic=0.2.4=pyhd8ed1ab_0
49 | - freetype=2.10.4=h4cff582_1
50 | - freexl=1.0.6=h0d85af4_0
51 | - fsspec=2021.11.1=pyhd8ed1ab_0
52 | - future=0.18.2=py39h6e9494a_4
53 | - geos=3.9.1=he49afe7_2
54 | - geotiff=1.6.0=h26421ea_6
55 | - gettext=0.19.8.1=hd1a6beb_1008
56 | - giflib=5.2.1=hbcb3906_2
57 | - gmp=6.2.1=h2e338ed_0
58 | - gnutls=3.6.13=h756fd2b_1
59 | - graphite2=1.3.13=h2e338ed_1001
60 | - harfbuzz=2.9.1=h159f659_1
61 | - hdf4=4.2.15=hefd3b78_3
62 | - hdf5=1.10.6=nompi_hc5d9132_1114
63 | - holoviews=1.14.7=pyhd8ed1ab_0
64 | - icu=68.2=he49afe7_0
65 | - idna=2.10=pyh9f0ad1d_0
66 | - imagecodecs=2021.8.26=py39he5b32f2_1
67 | - imageio=2.13.3=pyh239f2a4_0
68 | - imgaug=0.4.0=py_1
69 | - importlib-metadata=4.10.1=py39h6e9494a_0
70 | - interface_meta=1.2.4=pyhd8ed1ab_0
71 | - jasper=1.900.1=h636a363_1006
72 | - jbig=2.1=h0d85af4_2003
73 | - jdcal=1.4.1=py_0
74 | - jinja2=3.0.3=pyhd8ed1ab_0
75 | - joblib=1.1.0=pyhd8ed1ab_0
76 | - jpeg=9d=hbcb3906_0
77 | - json-c=0.15=hcb556a6_0
78 | - jxrlib=1.1=h35c211d_2
79 | - kealib=1.4.14=h31dd65d_2
80 | - kiwisolver=1.3.2=py39hf018cea_1
81 | - krb5=1.19.2=hcfbf3a7_3
82 | - lame=3.100=h35c211d_1001
83 | - lcms2=2.12=h577c468_0
84 | - lerc=3.0=he49afe7_0
85 | - libaec=1.0.6=he49afe7_0
86 | - libblas=3.9.0=12_osx64_openblas
87 | - libbrotlicommon=1.0.9=h0d85af4_6
88 | - libbrotlidec=1.0.9=h0d85af4_6
89 | - libbrotlienc=1.0.9=h0d85af4_6
90 | - libcblas=3.9.0=12_osx64_openblas
91 | - libcurl=7.80.0=hf45b732_1
92 | - libcxx=12.0.1=habf9029_0
93 | - libdap4=3.20.6=h3e144a0_2
94 | - libdeflate=1.8=h0d85af4_0
95 | - libedit=3.1.20191231=h0678c8f_2
96 | - libev=4.33=haf1e3a3_1
97 | - libffi=3.4.2=h0d85af4_5
98 | - libgdal=3.1.4=h85d2021_18
99 | - libgfortran=5.0.0=9_3_0_h6c81a4c_23
100 | - libgfortran5=9.3.0=h6c81a4c_23
101 | - libglib=2.70.2=hf1fb8c0_0
102 | - libiconv=1.16=haf1e3a3_0
103 | - libkml=1.3.0=h8fd9edb_1014
104 | - liblapack=3.9.0=12_osx64_openblas
105 | - liblapacke=3.9.0=12_osx64_openblas
106 | - libnetcdf=4.8.1=nompi_hb4d10b0_100
107 | - libnghttp2=1.43.0=h6f36284_1
108 | - libopenblas=0.3.18=openmp_h3351f45_0
109 | - libopencv=4.5.3=py39h852ad08_1
110 | - libpng=1.6.37=h7cec526_2
111 | - libpq=13.5=hea3049e_1
112 | - libprotobuf=3.16.0=hcf210ce_0
113 | - librttopo=1.1.0=h5413771_6
114 | - libspatialite=5.0.1=h035f608_6
115 | - libssh2=1.10.0=h52ee1ee_2
116 | - libtiff=4.3.0=hd146c10_2
117 | - libvpx=1.11.0=he49afe7_3
118 | - libwebp-base=1.2.1=h0d85af4_0
119 | - libxml2=2.9.12=h93ec3fd_0
120 | - libzip=1.8.0=h8b0c345_1
121 | - libzlib=1.2.11=h9173be1_1013
122 | - libzopfli=1.0.3=h046ec9c_0
123 | - lifelines=0.26.3=pyhd8ed1ab_0
124 | - llvm-openmp=12.0.1=hda6cdc1_1
125 | - locket=0.2.0=py_2
126 | - lz4-c=1.9.3=he49afe7_1
127 | - markdown=3.3.6=pyhd8ed1ab_0
128 | - markupsafe=2.0.1=py39h89e85a6_1
129 | - matplotlib-base=3.5.0=py39hb07454d_0
130 | - munkres=1.1.4=pyh9f0ad1d_0
131 | - ncurses=6.2=h2e338ed_4
132 | - nettle=3.6=hedd7734_0
133 | - networkx=2.6.3=pyhd8ed1ab_1
134 | - numpy=1.21.4=py39h7eed0ac_0
135 | - olefile=0.46=pyh9f0ad1d_1
136 | - opencv=4.5.3=py39h6e9494a_1
137 | - openh264=2.1.1=hfd3ada9_0
138 | - openjpeg=2.4.0=h6e7aa92_1
139 | - openpyxl=3.0.6=pyhd8ed1ab_0
140 | - openssl=1.1.1l=h0d85af4_0
141 | - packaging=21.3=pyhd8ed1ab_0
142 | - pandas=1.3.4=py39h4d6be9b_1
143 | - panel=0.12.6=pyhd8ed1ab_0
144 | - param=1.12.0=pyh6c4a22f_0
145 | - partd=1.2.0=pyhd8ed1ab_0
146 | - patsy=0.5.2=pyhd8ed1ab_0
147 | - pcre=8.45=he49afe7_0
148 | - pillow=8.4.0=py39he9bb72f_0
149 | - pip=21.3.1=pyhd8ed1ab_0
150 | - pixman=0.40.0=hbcb3906_0
151 | - poppler=21.03.0=h640f9a4_0
152 | - poppler-data=0.4.11=hd8ed1ab_0
153 | - postgresql=13.5=he8fe76e_1
154 | - proj=8.0.1=h1512c50_0
155 | - py-opencv=4.5.3=py39h71a6800_1
156 | - pycparser=2.21=pyhd8ed1ab_0
157 | - pyct=0.4.6=py_0
158 | - pyct-core=0.4.6=py_0
159 | - pyopenssl=21.0.0=pyhd8ed1ab_0
160 | - pyparsing=3.0.6=pyhd8ed1ab_0
161 | - pysocks=1.7.1=py39h6e9494a_4
162 | - python=3.9.7=h1248fe1_3_cpython
163 | - python-dateutil=2.8.2=pyhd8ed1ab_0
164 | - python_abi=3.9=2_cp39
165 | - pytz=2021.3=pyhd8ed1ab_0
166 | - pyviz_comms=2.1.0=pyhd8ed1ab_0
167 | - pywavelets=1.2.0=py39hc89836e_1
168 | - pyyaml=6.0=py39h89e85a6_3
169 | - rasterio=1.2.0=py39h2b252dd_0
170 | - readline=8.1=h05e3726_0
171 | - requests=2.25.1=pyhd3deb0d_0
172 | - scikit-image=0.19.0=py39h4d6be9b_0
173 | - scikit-learn=1.0.1=py39hd4eea88_2
174 | - scipy=1.7.3=py39h056f1c0_0
175 | - seaborn=0.11.2=hd8ed1ab_0
176 | - seaborn-base=0.11.2=pyhd8ed1ab_0
177 | - setuptools=59.4.0=py39h6e9494a_0
178 | - shapely=1.8.0=py39h1d9c377_0
179 | - six=1.16.0=pyh6c4a22f_0
180 | - snappy=1.1.8=hb1e8313_3
181 | - snuggs=1.4.7=py_0
182 | - sqlite=3.37.0=h23a322b_0
183 | - statsmodels=0.13.1=py39hc89836e_0
184 | - svt-av1=0.8.7=he49afe7_1
185 | - threadpoolctl=3.0.0=pyh8a188c0_0
186 | - tifffile=2021.11.2=pyhd8ed1ab_0
187 | - tiledb=2.3.4=h8370e7a_0
188 | - tk=8.6.11=h5dbffcc_1
189 | - toolz=0.11.2=pyhd8ed1ab_0
190 | - tornado=6.1=py39h89e85a6_2
191 | - tqdm=4.62.3=pyhd8ed1ab_0
192 | - typing_extensions=4.0.1=pyha770c72_0
193 | - tzcode=2021e=h0d85af4_0
194 | - tzdata=2021e=he74cb21_0
195 | - urllib3=1.26.7=pyhd8ed1ab_0
196 | - webencodings=0.5.1=py_1
197 | - wheel=0.37.0=pyhd8ed1ab_1
198 | - wrapt=1.13.3=py39h89e85a6_1
199 | - x264=1!161.3030=h0d85af4_1
200 | - x265=3.5=h940c156_1
201 | - xerces-c=3.2.3=h379762d_3
202 | - xlrd=2.0.1=pyhd8ed1ab_3
203 | - xz=5.2.5=haf1e3a3_1
204 | - yaml=0.2.5=haf1e3a3_0
205 | - zfp=0.5.5=h4a89273_8
206 | - zipp=3.7.0=pyhd8ed1ab_1
207 | - zlib=1.2.11=h9173be1_1013
208 | - zstd=1.5.0=h582d3a0_0
209 | - pip:
210 | - factor-analyzer==0.3.1
211 | - littleutils==0.2.2
212 | - matplotlib-venn==0.11.6
213 | - outdated==0.2.0
214 | - pandas-flavor==0.2.0
215 | - pingouin==0.3.11
216 | - statannot==0.2.3
217 | - tabulate==0.8.7
218 | - varclushi==0.1.0
219 | - xarray==0.16.2
220 | prefix: /Users/boehmk/anaconda3/envs/sklearn
221 |
--------------------------------------------------------------------------------
/survival-modeling/figures/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/barplots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/barplots/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/crs_plots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/crs_plots/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/feature_plots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/feature_plots/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/forest_plots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/forest_plots/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/km_plots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/km_plots/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/figures/multimodal/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/figures/multimodal/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/results/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/results/crs/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/crs/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/results/model_summaries/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/survival-modeling/results/model_summaries/.gitkeep
--------------------------------------------------------------------------------
/survival-modeling/utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import yaml
4 | import os
5 |
6 | with open('../global_config.yaml', 'r') as f:
7 | CONFIGS = yaml.safe_load(f)
8 | DATA_DIR = CONFIGS['data_dir']
9 | CODE_DIR = CONFIGS['code_dir']
10 |
11 |
12 | def load_crs(binarize=False, drop_net=False):
13 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'crs_df.csv'))
14 | df['Patient ID'] = df['Patient ID'].astype(str).apply(lambda x: x.zfill(3))
15 | df = df.set_index('Patient ID')
16 | if drop_net:
17 | df = df[df.CRS != 'NET']
18 | if binarize:
19 | df.loc[df.CRS=='2', 'CRS'] = '1/2'
20 | df.loc[df.CRS=='1', 'CRS'] = '1/2'
21 | df.loc[df.CRS=='3', 'CRS'] = '3/NET'
22 | df.loc[df.CRS=='NET', 'CRS'] = '3/NET'
23 | df.loc[df.CRS.str.contains('1'), 'CRS'] = '1/2'
24 | df['CRS'] = df['CRS'].astype(str)
25 | return df
26 |
27 |
28 | def load_os():
29 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv'))
30 | df['Patient ID'] = df['Patient ID'].astype(str)
31 | df = df.set_index('Patient ID')
32 | df = df[['duration.OS', 'observed.OS']]
33 | df = df.rename(columns={'duration.OS': 'duration',
34 | 'observed.OS': 'observed'})
35 | return df
36 |
37 | def load_pfs():
38 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv'))
39 | df['Patient ID'] = df['Patient ID'].astype(str)
40 | df = df.set_index('Patient ID')
41 | df = df[['duration.PFS', 'observed.PFS']]
42 | df = df.rename(columns={'duration.PFS': 'duration',
43 | 'observed.PFS': 'observed'})
44 | return df
45 |
46 | def load_clin(cols=['Complete gross resection', 'stage', 'age', 'Type of surgery', 'adnexal_lesion', 'omental_lesion', 'Received PARPi']):
47 | df = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'clin_df.csv'))
48 | df['Patient ID'] = df['Patient ID'].astype(str)
49 | df = df.set_index('Patient ID')
50 | df = df[cols]
51 | return df
52 |
53 | def load_pathomic_features():
54 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'hne-feature-extraction', 'tissue_tile_features', 'reference_hne_features.csv'))
55 | df['Patient ID'] = df['Patient ID'].astype(str)
56 | df = df.set_index('Patient ID')
57 | return df
58 |
59 |
60 | def load_radiomic_features(site='omentum'):
61 | if site == 'omentum':
62 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'ct-feature-extraction', 'features', 'ct_features_omentum.csv'))
63 | elif site == 'ovary':
64 | df = pd.read_csv(os.path.join(CODE_DIR, 'code', 'ct-feature-extraction', 'features', 'ct_features_ovary.csv'))
65 | else:
66 | raise NotImplementedError("Unknown radiomic site: {}".format(site))
67 | df['Patient ID'] = df['Patient ID'].astype(str)
68 | df = df.set_index('Patient ID')
69 | return df
70 |
71 |
72 |
73 | def load_all_ids(imaging_only=False):
74 | radiomic_ids = set(load_radiomic_features('omentum').index).union(set(load_radiomic_features('ovary').index))
75 | pathomic_ids = set(load_pathomic_features().index)
76 | if imaging_only:
77 | ids = pathomic_ids.union(radiomic_ids)
78 | else:
79 | clinical_ids = set(load_clin().index)
80 | genomic_ids = set(load_genom().index)
81 | ids = pathomic_ids.union(radiomic_ids).union(genomic_ids).union(clinical_ids)
82 | return ids
83 |
84 |
85 | def load_genom():
86 | genom = pd.read_csv(os.path.join(DATA_DIR, 'data', 'dataframes', 'genomic_df.csv'))
87 | genom['Patient ID'] = genom['Patient ID'].astype(str)
88 | genom = genom.set_index('Patient ID')
89 | genom.loc[genom['HRD status'] == 'HRP', 'hrd_status'] = False
90 | genom.loc[genom['HRD status'] == 'HRD', 'hrd_status'] = True
91 | genom = genom.dropna(subset=['HRD status'])
92 | genom.hrd_status = genom.hrd_status.astype(bool)
93 | genom = genom[['hrd_status']]
94 |
95 | return genom
96 |
--------------------------------------------------------------------------------
/tissue-type-training/checkpoints/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/checkpoints/.gitkeep
--------------------------------------------------------------------------------
/tissue-type-training/checkpoints/tissue_type_classifier_weights.torch:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/checkpoints/tissue_type_classifier_weights.torch
--------------------------------------------------------------------------------
/tissue-type-training/config.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | from torch import device
3 | import yaml
4 | import os
5 |
6 |
7 | with open('../global_config.yaml', 'r') as f:
8 | CONFIGS = yaml.safe_load(f)
9 | DATA_DIR = CONFIGS['data_dir']
10 |
11 | parser = argparse.ArgumentParser()
12 |
13 | parser.add_argument('--preprocessed_cohort_csv_path',
14 | type=str,
15 | default='preprocessed_msk_os_h&e_cohort.csv',
16 | help='Full path to CSV file describing whole slide images and outcomes.')
17 |
18 | parser.add_argument('--checkpoint_path',
19 | type=str,
20 | default='',
21 | help='Location of checkpoint to load for preds OR to resume for training.')
22 |
23 | parser.add_argument('--experiment_name',
24 | type=str,
25 | default='EXPERIMENT_NAME',
26 | help='name under which to store checkpoints')
27 |
28 | parser.add_argument('--crossval',
29 | type=int,
30 | default=0,
31 | help='0: no xval, >0: k-fold xval')
32 |
33 | parser.add_argument('--normalize',
34 | action='store_true',
35 | default=False,
36 | help='whether to apply macenko normalization')
37 |
38 | parser.add_argument('--cohort_csv_path',
39 | type=str,
40 | default='msk_os_h&e_cohort.csv',
41 | help='Full path to CSV file describing whole slide images and outcomes.')
42 |
43 | parser.add_argument('--tile_size',
44 | type=int,
45 | default=512,
46 | help='Edge length of each tile used for training/evaluation.')
47 |
48 | parser.add_argument('--tile_selection_type',
49 | type=str,
50 | default='manual',
51 | help='manual or otsu')
52 |
53 | parser.add_argument('--otsu_threshold',
54 | type=float,
55 | default=0.25,
56 | help='Percentage foreground required to include tile.')
57 |
58 | parser.add_argument('--purple_threshold',
59 | type=float,
60 | default=0.25,
61 | help='Percentage purple required to include tile.')
62 |
63 | parser.add_argument('--magnification',
64 | type=int,
65 | default=20,
66 | help='Magnification of WSI.')
67 |
68 | parser.add_argument('--model',
69 | type=str,
70 | default='resnet18',
71 | help='cnn architecture')
72 |
73 | parser.add_argument('--overlap',
74 | type=int,
75 | default=0,
76 | help='n pixels of tile overlap (for preprocess and pretile)')
77 |
78 | parser.add_argument('--batch_size',
79 | type=int,
80 | default=96,
81 | help='Batch size for inference or training.')
82 |
83 | parser.add_argument('--gpu',
84 | type=int,
85 | default=[0],
86 | nargs='+',
87 | help='Which GPU(s) to use for training.')
88 |
89 | parser.add_argument('--learning_rate',
90 | type=float,
91 | default=0.001,
92 | help='Learning rate for training.')
93 |
94 | parser.add_argument('--weight_decay',
95 | type=float,
96 | default=1e-4,
97 | help='Weight decay for Adam optimizer.')
98 |
99 | parser.add_argument('--num_epochs',
100 | type=int,
101 | default=20,
102 | help='Number of epochs to train.')
103 |
104 | parser.add_argument('--min_n_tiles',
105 | type=int,
106 | default=100,
107 | help='Minimum number of tiles per slide.')
108 |
109 | parser.add_argument('--tile_dir',
110 | type=str,
111 | default='pretilings_512',
112 | help='Directory to store/load tiles..')
113 |
114 | parser.add_argument('--val_pred_file',
115 | type=str,
116 | default='',
117 | help='Location of val pred .csv to load for eval.')
118 |
119 |
120 | args = parser.parse_args()
121 |
122 | args.preprocessed_cohort_csv_path = os.path.join(DATA_DIR, args.preprocessed_cohort_csv_path)
123 | args.cohort_csv_path = os.path.join(DATA_DIR, args.cohort_csv_path)
124 |
125 | desired_otsu_thumbnail_tile_size = 4
126 |
--------------------------------------------------------------------------------
/tissue-type-training/confusion_matrix_analysis.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from matplotlib import pyplot as plt
4 | from sklearn.metrics import confusion_matrix
5 | import seaborn as sns
6 | from scipy.stats import binom_test
7 |
8 |
9 | FONTSIZE = 8
10 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE)
11 | plt.rc('xtick',labelsize=FONTSIZE)
12 | plt.rc('ytick',labelsize=FONTSIZE)
13 | plt.rc("axes", labelsize=FONTSIZE)
14 |
15 |
16 |
17 | def cm_analysis(y_true, y_pred, labels=None, ymap=None, figsize=(2.66,2.66), filename='confusion_matrix.pdf', acc_pval=False):
18 | """
19 | Generate matrix plot of confusion matrix with pretty annotations.
20 | The plot image is saved to disk.
21 | args:
22 | y_true: true label of the data, with shape (nsamples,)
23 | y_pred: prediction of the data, with shape (nsamples,)
24 | filename: filename of figure file to save
25 | labels: string array, name the order of class labels in the confusion matrix.
26 | use `clf.classes_` if using scikit-learn models.
27 | with shape (nclass,).
28 | ymap: dict: any -> string, length == nclass.
29 | if not None, map the labels & ys to more understandable strings.
30 | Caution: original y_true, y_pred and labels must align.
31 | figsize: the size of the figure plotted.
32 | """
33 | if ymap is not None:
34 | y_pred = [ymap[int(yi)] for yi in y_pred]
35 | y_true = [ymap[int(yi)] for yi in y_true]
36 | cm = confusion_matrix(y_true, y_pred, labels=labels)
37 | cm_sum = np.sum(cm, axis=1, keepdims=True)
38 | cm_perc = cm / cm_sum.astype(float) * 100
39 | annot = np.empty_like(cm).astype(str)
40 | nrows, ncols = cm.shape
41 | for i in range(nrows):
42 | for j in range(ncols):
43 | c = cm[i, j]
44 | p = cm_perc[i, j]
45 | if i == j:
46 | s = cm_sum[i]
47 | # annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
48 | annot[i, j] = '{:.0f}%'.format(p)
49 | elif c == 0:
50 | annot[i, j] = ''
51 | else:
52 | # annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
53 | annot[i, j] = '{:.0f}%'.format(p)
54 |
55 | cm = pd.DataFrame(cm_perc, index=labels, columns=labels)
56 | cm.index.name = 'Actual'
57 | cm.columns.name = 'Predicted'
58 | fig, ax = plt.subplots(figsize=figsize, constrained_layout=True)
59 | # sns.set(font_scale=1.6)
60 | sns.heatmap(cm, annot=annot, fmt='', ax=ax, square=True, cbar=False, annot_kws={'fontsize': FONTSIZE})
61 | if acc_pval:
62 | y_pred = np.array(y_pred)
63 | y_true = np.array(y_true)
64 | _, counts = np.unique(y_true, return_counts=True)
65 | no_information_rate = np.max(counts) / float(len(y_true))
66 | plt.text(0, 0, 'p = {:4.3f}'.format(binom_test(x=np.sum(y_pred == y_true),
67 | n=len(y_pred),
68 | p=no_information_rate)))
69 | plt.savefig(filename, dpi=300 if '.svg' not in filename else None)
70 |
--------------------------------------------------------------------------------
/tissue-type-training/cross_validate_on_annotations.sh:
--------------------------------------------------------------------------------
1 | ARGS="--magnification 20
2 | --cohort_csv_path data/dataframes/tissuetype_hne_df.csv
3 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_tissuetype_hne_df.csv
4 | --tile_dir pretilings
5 | --tile_selection_type manual
6 | --otsu_threshold 0.2
7 | --batch_size 96
8 | --min_n_tiles 1
9 | --num_epochs 30
10 | --crossval 4
11 | --gpu 0
12 | --overlap 32
13 | --tile_size 64
14 | --normalize
15 | --model resnet18
16 | --experiment_name xval4
17 | --learning_rate 0.0005"
18 |
19 | python preprocess.py ${ARGS}
20 |
21 | python pretile.py ${ARGS}
22 |
23 | python train_tissue_tile_clf.py ${ARGS}
24 |
25 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold0_epoch021.torch" --val_pred_file "predictions/xval4_fold0_epoch021.torch_val.csv"
26 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold1_epoch021.torch" --val_pred_file "predictions/xval4_fold1_epoch021.torch_val.csv"
27 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold2_epoch021.torch" --val_pred_file "predictions/xval4_fold2_epoch021.torch_val.csv"
28 | python pred_tissue_tile.py ${ARGS} --checkpoint_path "checkpoints/xval4_fold3_epoch021.torch" --val_pred_file "predictions/xval4_fold3_epoch021.torch_val.csv"
29 |
30 | python eval_tissue_tile.py ${ARGS} --val_pred_file "predictions/xval4_fold{}_epoch021.torch_val.csv"
31 |
--------------------------------------------------------------------------------
/tissue-type-training/dataset.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import general_utils
3 | import torch
4 |
5 | from torch.utils.data import Dataset
6 | from PIL import Image, ImageFile
7 |
8 |
9 | ImageFile.LOAD_TRUNCATED_IMAGES = True
10 |
11 |
12 | class TissueTileDataset(Dataset):
13 | def __init__(self, df, tile_dir, transforms=None):
14 | self.df = df
15 | self.tile_dir = tile_dir
16 | self.transforms = transforms
17 |
18 | def __len__(self):
19 | return len(self.df)
20 |
21 | def __getitem__(self, item):
22 | row = self.df.iloc[item]
23 | tile = Image.open(row['tile_file_name'])
24 |
25 | if self.transforms:
26 | tile = self.transforms(tile)
27 |
28 | label = float(row['tile_class'])
29 |
30 | return '/'.join(row['tile_file_name'].split('/')[-2:]), tile.float(), label
31 |
--------------------------------------------------------------------------------
/tissue-type-training/environment.yml:
--------------------------------------------------------------------------------
1 | name: transformer
2 | channels:
3 | - pytorch
4 | - conda-forge
5 | - defaults
6 | dependencies:
7 | - _libgcc_mutex=0.1=main
8 | - _openmp_mutex=4.5=1_gnu
9 | - astor=0.8.1=py39h06a4308_0
10 | - autograd=1.3=pyhd3eb1b0_1
11 | - autograd-gamma=0.5.0=pyh9f0ad1d_0
12 | - blas=1.0=mkl
13 | - blosc=1.21.0=h8c45485_0
14 | - bottleneck=1.3.2=py39hdd57654_1
15 | - brotli=1.0.9=he6710b0_2
16 | - brunsli=0.1=h2531618_0
17 | - bzip2=1.0.8=h7b6447c_0
18 | - c-ares=1.18.1=h7f8727e_0
19 | - ca-certificates=2022.2.1=h06a4308_0
20 | - cairo=1.16.0=hf32fb01_1
21 | - certifi=2021.10.8=py39h06a4308_2
22 | - cfitsio=3.470=hf0d0db6_6
23 | - charls=2.2.0=h2531618_0
24 | - cloudpickle=2.0.0=pyhd3eb1b0_0
25 | - cudatoolkit=10.2.89=hfd86e86_1
26 | - cycler=0.11.0=pyhd3eb1b0_0
27 | - cytoolz=0.11.0=py39h27cfd23_0
28 | - dask-core=2021.10.0=pyhd3eb1b0_0
29 | - ffmpeg=4.2.2=h20bf706_0
30 | - fontconfig=2.13.1=h6c09931_0
31 | - fonttools=4.25.0=pyhd3eb1b0_0
32 | - formulaic=0.2.4=pyhd8ed1ab_0
33 | - freetype=2.11.0=h70c0345_0
34 | - fsspec=2022.1.0=pyhd3eb1b0_0
35 | - future=0.18.2=py39h06a4308_1
36 | - gdk-pixbuf=2.42.6=h8cc273a_5
37 | - geos=3.8.0=he6710b0_0
38 | - giflib=5.2.1=h7b6447c_0
39 | - glib=2.69.1=h4ff587b_1
40 | - gmp=6.2.1=h2531618_2
41 | - gnutls=3.6.15=he1e5248_0
42 | - gobject-introspection=1.68.0=py39he41a700_3
43 | - icu=58.2=he6710b0_3
44 | - imagecodecs=2021.8.26=py39h4cda21f_0
45 | - imageio=2.9.0=pyhd3eb1b0_0
46 | - intel-openmp=2021.4.0=h06a4308_3561
47 | - interface_meta=1.2.4=pyhd8ed1ab_0
48 | - joblib=1.1.0=pyhd3eb1b0_0
49 | - jpeg=9d=h7f8727e_0
50 | - jxrlib=1.1=h7b6447c_2
51 | - kiwisolver=1.3.2=py39h295c915_0
52 | - krb5=1.19.2=hac12032_0
53 | - lame=3.100=h7b6447c_0
54 | - lcms2=2.12=h3be6417_0
55 | - ld_impl_linux-64=2.35.1=h7274673_9
56 | - lerc=3.0=h295c915_0
57 | - libaec=1.0.4=he6710b0_1
58 | - libcurl=7.80.0=h0b77cf5_0
59 | - libdeflate=1.8=h7f8727e_5
60 | - libedit=3.1.20210910=h7f8727e_0
61 | - libev=4.33=h7f8727e_1
62 | - libffi=3.3=he6710b0_2
63 | - libgcc-ng=9.3.0=h5101ec6_17
64 | - libgfortran-ng=7.5.0=ha8ba4b0_17
65 | - libgfortran4=7.5.0=ha8ba4b0_17
66 | - libgomp=9.3.0=h5101ec6_17
67 | - libidn2=2.3.2=h7f8727e_0
68 | - libnghttp2=1.46.0=hce63b2e_0
69 | - libopus=1.3.1=h7b6447c_0
70 | - libpng=1.6.37=hbc83047_0
71 | - libssh2=1.9.0=h1ba5d50_1
72 | - libstdcxx-ng=9.3.0=hd4cf53a_17
73 | - libtasn1=4.16.0=h27cfd23_0
74 | - libtiff=4.2.0=h85742a9_0
75 | - libunistring=0.9.10=h27cfd23_0
76 | - libuuid=1.0.3=h7f8727e_2
77 | - libuv=1.40.0=h7b6447c_0
78 | - libvpx=1.7.0=h439df22_0
79 | - libwebp=1.2.2=h55f646e_0
80 | - libwebp-base=1.2.2=h7f8727e_0
81 | - libxcb=1.14=h7b6447c_0
82 | - libxml2=2.9.12=h03d6c58_0
83 | - libzopfli=1.0.3=he6710b0_0
84 | - lifelines=0.26.5=pyhd8ed1ab_0
85 | - locket=0.2.1=py39h06a4308_1
86 | - lz4-c=1.9.3=h295c915_1
87 | - matplotlib-base=3.5.1=py39ha18d171_0
88 | - mkl=2021.4.0=h06a4308_640
89 | - mkl-service=2.4.0=py39h7f8727e_0
90 | - mkl_fft=1.3.1=py39hd3c417c_0
91 | - mkl_random=1.2.2=py39h51133e4_0
92 | - munkres=1.1.4=py_0
93 | - ncurses=6.3=h7f8727e_2
94 | - nettle=3.7.3=hbbd107a_1
95 | - networkx=2.6.3=pyhd3eb1b0_0
96 | - numexpr=2.8.1=py39h6abb31d_0
97 | - numpy=1.21.2=py39h20f2e39_0
98 | - numpy-base=1.21.2=py39h79a1101_0
99 | - openh264=2.1.1=h4ff587b_0
100 | - openjpeg=2.4.0=h3ad879b_0
101 | - openslide=3.4.1=h8137273_1
102 | - openslide-python=1.1.2=py39h3811e60_0
103 | - openssl=1.1.1m=h7f8727e_0
104 | - packaging=21.3=pyhd3eb1b0_0
105 | - pandas=1.3.5=py39h8c16a72_0
106 | - partd=1.2.0=pyhd3eb1b0_0
107 | - patsy=0.5.2=py39h06a4308_1
108 | - pcre=8.45=h295c915_0
109 | - pillow=9.0.1=py39h22f2fdc_0
110 | - pip=21.2.4=py39h06a4308_0
111 | - pixman=0.40.0=h7f8727e_1
112 | - pyparsing=3.0.4=pyhd3eb1b0_0
113 | - python=3.9.7=h12debd9_1
114 | - python-dateutil=2.8.2=pyhd3eb1b0_0
115 | - python_abi=3.9=2_cp39
116 | - pytorch=1.10.1=py3.9_cuda10.2_cudnn7.6.5_0
117 | - pytorch-mutex=1.0=cuda
118 | - pytz=2021.3=pyhd3eb1b0_0
119 | - pywavelets=1.1.1=py39h6323ea4_4
120 | - pyyaml=6.0=py39h7f8727e_1
121 | - readline=8.1.2=h7f8727e_1
122 | - scikit-image=0.18.3=py39h51133e4_0
123 | - scikit-learn=1.0.2=py39h51133e4_0
124 | - scipy=1.7.3=py39hc147768_0
125 | - seaborn=0.11.2=hd8ed1ab_0
126 | - seaborn-base=0.11.2=pyhd8ed1ab_0
127 | - setuptools=58.0.4=py39h06a4308_0
128 | - shapely=1.7.1=py39h1728cc4_0
129 | - six=1.16.0=pyhd3eb1b0_1
130 | - snappy=1.1.8=he6710b0_0
131 | - sqlite=3.37.2=hc218d9a_0
132 | - statsmodels=0.13.0=py39h7f8727e_0
133 | - threadpoolctl=2.2.0=pyh0d69192_0
134 | - tifffile=2021.7.2=pyhd3eb1b0_2
135 | - tk=8.6.11=h1ccaba5_0
136 | - toolz=0.11.2=pyhd3eb1b0_0
137 | - torchaudio=0.10.1=py39_cu102
138 | - torchvision=0.11.2=py39_cu102
139 | - typing_extensions=3.10.0.2=pyh06a4308_0
140 | - tzdata=2021e=hda174b7_0
141 | - wheel=0.37.1=pyhd3eb1b0_0
142 | - wrapt=1.13.3=py39h7f8727e_2
143 | - x264=1!157.20191217=h7b6447c_0
144 | - xz=5.2.5=h7b6447c_0
145 | - yaml=0.2.5=h7b6447c_0
146 | - zfp=0.5.5=h295c915_6
147 | - zlib=1.2.11=h7f8727e_4
148 | - zstd=1.4.9=haebb681_0
149 | - pip:
150 | - argparse==1.4.0
151 | - einops==0.3.2
152 | - nystrom-attention==0.0.11
153 |
--------------------------------------------------------------------------------
/tissue-type-training/eval_tissue_tile.py:
--------------------------------------------------------------------------------
1 | from sklearn.metrics import average_precision_score, accuracy_score, balanced_accuracy_score, confusion_matrix
2 | from sklearn.utils.class_weight import compute_class_weight
3 | import pandas as pd
4 | import config
5 | import json
6 | from openslide import OpenSlide
7 | from openslide.lowlevel import OpenSlideUnsupportedFormatError
8 | from PIL import Image
9 | import os
10 | import general_utils
11 | import numpy as np
12 | import matplotlib.pyplot as plt
13 | from joblib import Parallel, delayed
14 | from confusion_matrix_analysis import cm_analysis
15 | from general_utils import label_image_tissue_type, add_scale_bar
16 |
17 |
18 | FONTSIZE = 7
19 | plt.rc('legend',fontsize=FONTSIZE, title_fontsize=FONTSIZE)
20 | plt.rc('xtick',labelsize=FONTSIZE)
21 | plt.rc('ytick',labelsize=FONTSIZE)
22 | plt.rc("axes", labelsize=FONTSIZE)
23 |
24 |
25 | def get_auprc(df):
26 | try:
27 | assert 'score_1' in df.columns
28 | assert 'label' in df.columns
29 | except AssertionError:
30 | raise AssertionError('label and pred_score not in {}'.format(df.columns))
31 | preds = df['score_1'].tolist()
32 | is_hrd = df['label'].tolist()
33 |
34 | auprc = average_precision_score(y_true=is_hrd, y_score=preds)
35 | return {'auprc': auprc}
36 |
37 |
38 | def get_random_auprc_from_df(df):
39 | random_df = df.copy(deep=True)
40 | random_df['pred_score'] = np.random.permutation(random_df['pred_score'].values)
41 | return {'random_auprc': get_auprc(random_df)}
42 |
43 |
44 | def get_accuracy(df):
45 | return {'accuracy': accuracy_score(y_true=df.label, y_pred=df.predicted_class)}
46 |
47 |
48 | def get_all_single_class_auprc_values(df):
49 | d = {}
50 | for class_ in df.label.unique():
51 | class_ = int(class_)
52 | #print('class {}'.format(class_))
53 | is_truly_class = df['label'] == class_
54 | df.loc[is_truly_class, 'temp_binary_truth'] = 1
55 | df.loc[~is_truly_class, 'temp_binary_truth'] = 0
56 |
57 | df['temp_binary_pred'] = df['score_{}'.format(class_)]
58 | d['auprc_{}'.format(class_)] = average_precision_score(y_true=df['temp_binary_truth'],
59 | y_score=df['temp_binary_pred'])
60 | df.drop(columns=['temp_binary_pred', 'temp_binary_truth'], inplace=True)
61 | return d
62 |
63 |
64 | def get_confusion_matrix(df):
65 | raise NotImplementedError
66 | return {'confusion_matrix': confusion_matrix(y_true=df.label, y_pred=df.predicted_class)}
67 |
68 |
69 | def visualize(df, pred_file):
70 | sub_dir = os.path.join('visualizations', pred_file.split('/')[-1].replace('.csv', ''))
71 | if not os.path.exists(sub_dir):
72 | os.mkdir(sub_dir)
73 |
74 | if len(df) == 0:
75 | return
76 |
77 | for image_path, sub_df in df.groupby('image_path'):
78 | _visualize(image_path, sub_df, sub_dir)
79 |
80 |
81 | def _visualize(image_path, sub_df, sub_dir):
82 | desired_otsu_thumbnail_tile_size = 8 # 16
83 | scale_factor = config.args.tile_size / desired_otsu_thumbnail_tile_size
84 | print(image_path)
85 | try:
86 | slide = OpenSlide(image_path)
87 | except OpenSlideUnsupportedFormatError:
88 | print(image_path)
89 | exit()
90 | slide_mag = general_utils.get_magnification(slide)
91 | if slide_mag != config.args.magnification:
92 | if (slide_mag / config.args.magnification) == 2:
93 | scale = scale_factor * 2
94 | elif (slide_mag / config.args.magnification) == 4:
95 | scale = scale_factor * 4
96 | else:
97 | raise AssertionError('Invalid scale')
98 | else:
99 | scale = scale_factor
100 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale)
101 | sub_df['address'] = sub_df['tile_file_name'].apply(
102 | lambda x: [int(y) for y in x.replace('.png', '').split('/')[1].split('_')])
103 |
104 | # print(sub_df)
105 | thumbnail = general_utils.visualize_tile_scoring(thumbnail,
106 | desired_otsu_thumbnail_tile_size,
107 | sub_df.address.tolist(),
108 | sub_df.predicted_class.tolist(),
109 | overlap=int(config.args.overlap//scale_factor),
110 | range_=[0, 3])
111 | thumbnail = Image.fromarray(thumbnail)
112 | thumbnail = label_image_tissue_type(thumbnail, map_reverse_key)
113 | thumbnail = add_scale_bar(thumbnail, scale, slide_mag)
114 | thumbnail.save('{}/{}_{}_eval.png'.format(sub_dir, image_path.split('/')[-1], set_name))
115 |
116 |
117 | if __name__ == '__main__':
118 | all_preds = []
119 | all_acc = []
120 | for set_name, _pred_file in zip(['val'], [config.args.val_pred_file]):
121 | for fold in range(config.args.crossval):
122 | pred_file = _pred_file.format(fold)
123 | preds = pd.read_csv(pred_file)
124 | preds = preds.set_index('tile_file_name')
125 | n_classes = 4
126 |
127 | results = {}
128 |
129 | if n_classes == 2:
130 | results.update(get_auprc(preds))
131 | results.update(get_random_auprc_from_df(preds))
132 | print(preds)
133 | preds['predicted_class'] = preds.drop(columns='label').idxmax(axis='columns').str.replace(
134 | 'score_', '').astype(int)
135 | preds['certainty'] = preds.drop(columns=['label', 'predicted_class']).max(axis='columns')
136 |
137 | results.update(get_all_single_class_auprc_values(preds))
138 | results.update(get_accuracy(preds))
139 | all_acc.append(results['accuracy'])
140 |
141 | map_key = {0: 'Stroma', 1: 'Tumor', 2: 'Fat', 3: 'Necrosis'}
142 | map_reverse_key = dict([(v, k) for k, v in map_key.items()])
143 | all_preds.append(preds)
144 | cm_analysis(y_true=preds.label,
145 | y_pred=preds.predicted_class,
146 | ymap=map_key,
147 | labels=['Stroma', 'Tumor', 'Fat', 'Necrosis'],
148 | filename='evals/{}'.format(pred_file.split('/')[-1].replace('.csv', '_confusion.png')))
149 |
150 | with open('evals/{}'.format(pred_file.split('/')[-1].replace('.csv', '.txt')), 'w') as f:
151 | json.dump(results, f)
152 | preds = preds.reset_index()
153 | preds['image_id'] = preds['tile_file_name'].str.split('/').map(lambda x: str(x[0]))
154 | df = pd.read_csv(config.args.preprocessed_cohort_csv_path)[['image_path']]
155 | df['image_id'] = df['image_path'].str.split('/').map(lambda x: str(x[-1][:-4]))
156 | df = df.set_index('image_id')
157 | preds = preds.join(df, on='image_id', how='left').drop(columns=['image_id'])
158 | visualize(preds, pred_file)
159 |
160 | print('{:4.3f} +/- {:4.3f}'.format(np.mean(all_acc), np.std(all_acc)))
161 | all_preds = pd.concat(all_preds, axis=0)
162 | all_preds.to_csv('evals/all_preds_{}'.format(config.args.val_pred_file.split('/')[-1].format('_all')))
163 | cm_analysis(y_true=all_preds.label,
164 | y_pred=all_preds.predicted_class,
165 | ymap=map_key,
166 | labels=['Stroma', 'Tumor', 'Fat', 'Necrosis'],
167 | filename='evals/integrated.svg')
168 |
--------------------------------------------------------------------------------
/tissue-type-training/evals/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/evals/.gitkeep
--------------------------------------------------------------------------------
/tissue-type-training/general_utils.py:
--------------------------------------------------------------------------------
1 | import colorsys
2 | import re
3 |
4 | from openslide import OpenSlide, ImageSlide
5 | from openslide.deepzoom import DeepZoomGenerator
6 | from skimage.color import rgb2gray
7 | from skimage.filters import threshold_otsu
8 | from torch import distributed as dist
9 |
10 | import numpy as np
11 | import pandas as pd
12 | import torch
13 | import os
14 | from random import choice
15 | from PIL import Image, ImageDraw, ImageFont
16 | from skimage.draw import rectangle_perimeter, rectangle
17 | from skimage import color
18 | from copy import deepcopy
19 | from datetime import datetime
20 | from torchvision import transforms
21 | from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
22 |
23 |
24 | def get_magnification(slide):
25 | return int(slide.properties['aperio.AppMag'])
26 |
27 |
28 | def get_downscaled_thumbnail(slide, scale_factor=32):
29 | new_width = slide.dimensions[0] // scale_factor
30 | new_height = slide.dimensions[1] // scale_factor
31 | img = slide.get_thumbnail((new_width, new_height))
32 | return np.array(img)
33 |
34 |
35 | def get_full_resolution_generator(slide, tile_size, overlap=0, level_offset=0):
36 | assert isinstance(slide, OpenSlide) or isinstance(slide, ImageSlide)
37 | generator = DeepZoomGenerator(slide, overlap=overlap, tile_size=tile_size, limit_bounds=False)
38 | generator_level = generator.level_count - 1 - level_offset
39 | if level_offset == 0:
40 | assert generator.level_dimensions[generator_level] == slide.dimensions
41 | return generator, generator_level
42 |
43 |
44 | def adjust_scale_for_slide_mag(slide_mag, desired_mag, scale):
45 | if slide_mag != desired_mag:
46 | if slide_mag < desired_mag:
47 | raise AssertionError('expected mag >={} but got {}'.format(desired_mag, slide_mag))
48 | elif (slide_mag / desired_mag) == 2:
49 | scale *= 2
50 | elif (slide_mag / desired_mag) == 4:
51 | scale *= 4
52 | else:
53 | raise AssertionError('expected mag {} or {} but got {}'.format(desired_mag, 2 * desired_mag, slide_mag))
54 | return scale
55 |
56 |
57 | def visualize_tiling(_thumbnail, tile_size, tile_addresses, overlap=0):
58 | """
59 | Draw black boxes around tiles passing threshold
60 | :param _thumbnail: np.ndarray
61 | :param tile_size: int
62 | :param tile_addresses:
63 | :return: new thumbnail image with black boxes around tiles passing threshold
64 | """
65 | assert isinstance(_thumbnail, np.ndarray) and isinstance(tile_size, int)
66 | thumbnail = deepcopy(_thumbnail)
67 | generator, generator_level = get_full_resolution_generator(array_to_slide(thumbnail),
68 | tile_size=tile_size,
69 | overlap=overlap)
70 |
71 | for address in tile_addresses:
72 | if isinstance(address, list):
73 | address = address[0]
74 | extent = generator.get_tile_dimensions(generator_level, address)
75 | start = (address[1] * tile_size, address[0] * tile_size) # flip because OpenSlide uses
76 | # (column, row), but skimage
77 | # uses (row, column)
78 | rr, cc = rectangle_perimeter(start=start, extent=extent, shape=thumbnail.shape)
79 | thumbnail[rr, cc] = 1
80 |
81 | return thumbnail
82 |
83 |
84 | def colorize(image, hue, saturation=1.0):
85 | """ Add color of the given hue to an RGB image.
86 |
87 | By default, set the saturation to 1 so that the colors pop!
88 | """
89 | hsv = color.rgb2hsv(image)
90 | hsv[:, :, 1] = saturation
91 | hsv[:, :, 0] = hue
92 | rgb = (color.hsv2rgb(hsv) * 255).astype(int)
93 | return rgb
94 |
95 |
96 | def visualize_tile_scoring(_thumbnail, tile_size, tile_addresses, tile_scores, overlap=0, range_=[-3, 3]):
97 | """
98 | Draw black boxes around tiles passing threshold
99 | :param _thumbnail: np.ndarray
100 | :param tile_size: int
101 | :param tile_addresses:
102 | :return: new thumbnail image with black boxes around tiles passing threshold
103 | """
104 | denom = 2 * (range_[1] - range_[0])
105 | assert isinstance(_thumbnail, np.ndarray) and isinstance(tile_size, int)
106 | thumbnail = deepcopy(_thumbnail)
107 | generator, generator_level = get_full_resolution_generator(array_to_slide(thumbnail),
108 | tile_size=tile_size,
109 | overlap=overlap)
110 |
111 | for address, score in zip(tile_addresses, tile_scores):
112 | extent = generator.get_tile_dimensions(generator_level, address)
113 | start = (address[1] * tile_size, address[0] * tile_size) # flip because OpenSlide uses
114 | # (column, row), but skimage
115 | # uses (row, column)
116 |
117 | rr, cc = rectangle(start=start, extent=extent, shape=thumbnail.shape)
118 | thumbnail[rr, cc] = colorize(thumbnail[rr, cc], hue=0.5-(score-range_[0])/denom, saturation=0.5)
119 |
120 | return thumbnail
121 |
122 |
123 | def array_to_slide(arr):
124 | assert isinstance(arr, np.ndarray)
125 | slide = ImageSlide(Image.fromarray(arr))
126 | return slide
127 |
128 |
129 | def get_current_time():
130 | return str(datetime.now()).replace(' ', '_').split('.')[0].replace(':', '.')
131 |
132 |
133 | def load_preprocessed_df(file_name, min_n_tiles, cols=None, seed=None, explode=True):
134 | if cols:
135 | df_ = pd.read_csv(file_name, low_memory=False, usecols=cols)
136 | else:
137 | df_ = pd.read_csv(file_name, low_memory=False)
138 |
139 | df_ = df_[df_.n_foreground_tiles >= min_n_tiles]
140 |
141 | df_.tile_address = df_.tile_address.map(eval)
142 | if explode:
143 | df_ = df_.explode('tile_address')
144 | df_['tile_file_name'] = df_.tile_address.apply(
145 | lambda x: str(x).split(')')[0].replace('(', '').replace(')', '').replace('[', '').replace(']', '').replace(', ', '_') + '.png')
146 | # lambda x: get_tile_file_name('---', x.img_hid, x.tile_address),
147 | # axis=1).str.split('/').map(lambda x: x[-1])
148 |
149 | return df_
150 |
151 |
152 | def k_fold_ptwise_crossval(df, k, seed):
153 | if 'observed' in df.columns:
154 | if 'Patient ID' in df.columns:
155 | temp_df = df.groupby('Patient ID').agg('mean').observed
156 | else:
157 | temp_df = df.groupby('image_path').agg('mean').observed
158 |
159 | patient_ids = np.array(temp_df.index.tolist())
160 | observed = np.array(temp_df.tolist())
161 |
162 | kf = StratifiedKFold(n_splits=k, random_state=seed % (2**32 - 1), shuffle=True).split(patient_ids, observed)
163 | else:
164 | if 'Patient ID' in df.columns:
165 | patient_ids = np.sort(df['Patient ID'].unique())
166 | else:
167 | patient_ids = np.sort(df['image_path'].unique())
168 |
169 | kf = KFold(n_splits=k, random_state=seed % (2**32 - 1), shuffle=True).split(patient_ids)
170 |
171 | df_list = []
172 | for train_indices, test_indices in kf:
173 | train_labels = patient_ids[train_indices]
174 | test_labels = patient_ids[test_indices]
175 | # train_labels = list(set(DF.index[train_indices].tolist()))
176 | # test_labels = list(set(DF.index.tolist()) - set(train_labels))
177 | if 'Patient ID' in df.columns:
178 | train_mask = df['Patient ID'].isin(train_labels)
179 | test_mask = df['Patient ID'].isin(test_labels)
180 | else:
181 | train_mask = df['image_path'].isin(train_labels)
182 | test_mask = df['image_path'].isin(test_labels)
183 |
184 | assert test_mask.sum() > 0
185 | assert train_mask.sum() > 0
186 |
187 | DF = df.copy(deep=True)
188 | DF.loc[train_mask, 'split'] = 'train'
189 | DF.loc[test_mask, 'split'] = 'val'
190 | df_list.append(DF)
191 | return df_list
192 |
193 |
194 | def get_slide_dir(_dir, slide_file_name):
195 | slide_stem = slide_file_name[:-4]
196 | return os.path.join(_dir, slide_stem)
197 |
198 |
199 | def get_tile_file_name(_dir, slide_file_name, address):
200 | address_suffix = str(address).replace('(', '').replace(')', '').replace(', ', '_')
201 | file_name = address_suffix + '.png'
202 | return os.path.join(get_slide_dir(_dir, slide_file_name), file_name)
203 |
204 |
205 | def load_ddp_state_dict_to_device(path, device='cpu', ddp_to_serial=True):
206 | assert isinstance(device, str)
207 | ddp_state_dict = torch.load(path, map_location=device)
208 | if ddp_to_serial:
209 | state_dict = {}
210 | for key, value in ddp_state_dict.items():
211 | state_dict[key.replace('module.', '')] = value
212 | return state_dict
213 | else:
214 | return ddp_state_dict
215 |
216 |
217 | def setup(rank, world_size):
218 | # initialize the process group
219 | os.environ['MASTER_ADDR'] = 'localhost'
220 | try:
221 | os.environ['MASTER_PORT'] = '12355'
222 | dist.init_process_group("nccl", rank=rank, world_size=world_size)
223 | except RuntimeError:
224 | os.environ['MASTER_PORT'] = '1234'
225 | dist.init_process_group("nccl", rank=rank, world_size=world_size)
226 |
227 | print('device {} initialized'.format(rank))
228 |
229 |
230 | def cleanup():
231 | dist.destroy_process_group()
232 |
233 |
234 | def get_starting_timestamp(config):
235 | if config.args.checkpoint_path:
236 | short_path = config.args.checkpoint_path.split('/')[-1]
237 | starting_timestamp_ = re.search('^.*(?=(\_epoch))', short_path).group(0)
238 | if '_fold' in starting_timestamp_:
239 | starting_timestamp_ = starting_timestamp_.split('_fold')[0]
240 | else:
241 | starting_timestamp_ = get_current_time()
242 | return starting_timestamp_
243 |
244 |
245 | def load_model_state_dict(model, checkpoint_path, device='cpu'):
246 | assert isinstance(device, str)
247 | model.load_state_dict(load_ddp_state_dict_to_device(
248 | os.path.join('checkpoints', checkpoint_path), device=device))
249 |
250 |
251 | def get_starting_epoch(config):
252 | if config.args.checkpoint_path:
253 | starting_epoch = int(re.search('epoch(\d+)', config.args.checkpoint_path).group(1))
254 | else:
255 | starting_epoch = 1
256 | return starting_epoch
257 |
258 |
259 | def log_results_string(epoch, starting_epoch, train_loss, val_loss, starting_timestamp, fold, other_keys=dict()):
260 | epoch_str = get_epoch_str(epoch, starting_epoch)
261 | results_str = 'epoch {}: train loss {:.3e} | val loss {:.3e}'.format(
262 | epoch_str, train_loss, val_loss)
263 | if other_keys:
264 | other_keys = list(other_keys.items())
265 | other_keys.sort(key=lambda x: x[0])
266 | for key, val in other_keys:
267 | results_str += ' | {} {:.3e}'.format(key, val)
268 | print(results_str)
269 | with open('checkpoints/{}_fold{}_log.txt'.format(starting_timestamp, fold), 'a+') as file:
270 | file.write(results_str + '\n')
271 |
272 |
273 | def get_train_transforms(smaller_dim=None, normalize=True):
274 | l = []
275 | if smaller_dim:
276 | l.append(transforms.RandomCrop(size=smaller_dim))
277 |
278 | l.extend([
279 | transforms.RandomHorizontalFlip(0.5),
280 | transforms.RandomVerticalFlip(0.5),
281 | transforms.ColorJitter(brightness=0.1,
282 | contrast=0.1,
283 | saturation=0.05,
284 | hue=0.01),
285 | transforms.ToTensor()]
286 | )
287 | if normalize:
288 | l.append(transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)))
289 | return transforms.Compose(l)
290 |
291 |
292 | def get_val_transforms(smaller_dim=None, normalize=True):
293 | l = []
294 | if smaller_dim:
295 | l.append(transforms.RandomCrop(size=smaller_dim))
296 | l.append(transforms.ToTensor())
297 | if normalize:
298 | l.append(transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)))
299 | return transforms.Compose(l)
300 |
301 |
302 | def get_epoch_str(epoch, starting_epoch):
303 | return str(epoch + 1 + starting_epoch).zfill(3)
304 |
305 |
306 | def get_checkpoint_path(starting_timestamp, epoch, starting_epoch, fold):
307 | epoch_str = get_epoch_str(epoch, starting_epoch)
308 | return 'checkpoints/{}_fold{}_epoch{}.torch'.format(starting_timestamp, fold, epoch_str)
309 |
310 |
311 | def make_otsu(img, scale=1):
312 | """
313 | Make image with pixel-wise foreground/background labels.
314 | :param img: grayscale np.ndarray
315 | :return: np.ndarray where each pixel is 0 if background and 1 if foreground
316 | """
317 | assert isinstance(img, np.ndarray)
318 | _img = rgb2gray(img)
319 | threshold = threshold_otsu(_img)
320 | return (_img < (threshold * scale)).astype(float)
321 |
322 |
323 | def label_image_tissue_type(thumbnail, map_key):
324 | """
325 | Labels tissue with hue overlay based on predicted classes.
326 | """
327 | vals = list(map_key.values())
328 | colors = []
329 | range_ = [np.min(vals), np.max(vals)]
330 | denom = 2 * (range_[1] - range_[0])
331 | for class_, score in map_key.items():
332 | colors.append(tuple([int(255 * x) for x in
333 | colorsys.hsv_to_rgb(0.5 - (score - range_[0]) / denom, 0.5, 1.0)]))
334 | d = ImageDraw.Draw(thumbnail)
335 |
336 | text_locations = [(10, 10+40*x) for x in range(len(map_key))]
337 | for (class_, score), text_location in zip(map_key.items(), text_locations):
338 | d.text(text_location, class_, fill=colors[score])
339 | return thumbnail
340 |
341 |
342 | def get_fold_slides(df, world_size, rank):
343 | all_slides = df.image_id.unique()
344 | chunks = np.array_split(all_slides, world_size)
345 | return chunks[rank]
346 |
347 |
348 | def add_scale_bar(thumbnail, scale, slide_mag, len_in_um=1000):
349 | if slide_mag == 20:
350 | um_per_pix = 0.5
351 | elif slide_mag == 40:
352 | um_per_pix = 0.25
353 | else:
354 | raise RuntimeError("Unhandled slide mag {}x".format(slide_mag))
355 |
356 | um_per_pix *= scale
357 |
358 | len_in_pixels = len_in_um / float(um_per_pix)
359 |
360 | endpoints = [(10, 10+40*6), (10+len_in_pixels, 10+40*6)]
361 | d = ImageDraw.Draw(thumbnail)
362 | d.line(endpoints, fill='black', width=5)
363 | return thumbnail
364 |
--------------------------------------------------------------------------------
/tissue-type-training/models.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import torchvision.models
4 | from torchvision.models import resnet18, resnet34, resnet50, squeezenet1_1, vgg19_bn
5 |
6 |
7 | class TissueTileNet(nn.Module):
8 | def __init__(self, model, n_classes, activation=None):
9 | super(TissueTileNet, self).__init__()
10 | if type(model) in [torchvision.models.resnet.ResNet]:
11 | model.fc = nn.Linear(512, n_classes)
12 | elif type(model) == torchvision.models.squeezenet.SqueezeNet:
13 | list(model.children())[1][1] = nn.Conv2d(512, n_classes, kernel_size=1, stride=1)
14 | else:
15 | raise NotImplementedError
16 | self.model = model
17 | self.activation = activation
18 |
19 | def forward(self, x):
20 | y = self.model(x)
21 | if self.activation:
22 | y = self.activation(y)
23 |
24 | return y
25 |
26 |
27 | def get_model(cf):
28 | if cf.args.model == 'resnet18':
29 | return resnet18(pretrained=True)
30 | elif cf.args.model == 'resnet34':
31 | return resnet34(pretrained=True)
32 | elif cf.args.model == 'resnet50':
33 | return resnet50(pretrained=True)
34 | elif cf.args.model == 'squeezenet':
35 | return squeezenet1_1(pretrained=True)
36 | elif cf.args.model == 'vgg19':
37 | return vgg19_bn(pretrained=True)
38 | elif cf.args.model == 'tissue-type':
39 | model = load_tissue_tile_net()
40 | model = model.model
41 | for idx, child in enumerate(model.children()):
42 | if idx in [0, 1, 2, 3, 4, 5, 6, 7]: # 7 is last res block, 9 is fc layer
43 | for param in child.parameters():
44 | param.requires_grad = False
45 | return model
46 | else:
47 | raise RuntimeError("Model type {} unknown".format(cf.model))
48 |
49 | def load_tissue_tile_net(checkpoint_path='', activation=None, n_classes=4):
50 | model = TissueTileNet(resnet18(), n_classes, activation=activation)
51 | model.load_state_dict(torch.load(
52 | checkpoint_path,
53 | map_location='cpu'))
54 | return model
55 |
--------------------------------------------------------------------------------
/tissue-type-training/pred_tissue_tile.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import os
3 | import re
4 |
5 | from torch.utils.data import DataLoader
6 |
7 | import config
8 | import general_utils
9 | from dataset import TissueTileDataset
10 | from train_tissue_tile_clf import make_preds, prep_df
11 | from models import TissueTileNet, get_model
12 |
13 | if __name__ == '__main__':
14 | assert config.args.checkpoint_path
15 | assert len(config.args.gpu) == 1
16 |
17 | device_str = 'cuda:{}'.format(config.args.gpu[0])
18 | device = torch.device(device_str)
19 |
20 | num_workers = 8
21 | transforms = general_utils.get_val_transforms()
22 | seed = 1123011750
23 |
24 | df_, n_classes, map_key, map_reverse_key = prep_df(config.args.preprocessed_cohort_csv_path,
25 | tile_dir=config.args.tile_dir)
26 | fold = int(config.args.checkpoint_path.split('fold')[1][0])
27 | df = general_utils.k_fold_ptwise_crossval(df_, config.args.crossval, seed)[fold]
28 |
29 | model = TissueTileNet(model=get_model(config),
30 | n_classes=n_classes,
31 | activation=torch.nn.Softmax(dim=1))
32 | model.load_state_dict(general_utils.load_ddp_state_dict_to_device(config.args.checkpoint_path))
33 | model.to(device)
34 |
35 |
36 | print('making val preds for fold {}'.format(fold))
37 | val_dataset = TissueTileDataset(df=df[df.split == 'val'],
38 | tile_dir=config.args.tile_dir,
39 | transforms=transforms)
40 | assert len(val_dataset) > 0
41 | val_loader = DataLoader(val_dataset,
42 | batch_size=config.args.batch_size,
43 | num_workers=num_workers)
44 | make_preds(model,
45 | val_loader,
46 | device,
47 | config.args.val_pred_file,
48 | n_classes)
49 |
--------------------------------------------------------------------------------
/tissue-type-training/predictions/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/predictions/.gitkeep
--------------------------------------------------------------------------------
/tissue-type-training/preprocess.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import yaml
4 | import pandas as pd
5 | import numpy as np
6 |
7 | from openslide import OpenSlide
8 | from openslide.lowlevel import OpenSlideUnsupportedFormatError
9 | from shapely.geometry import Polygon, Point
10 | from PIL import Image
11 | from joblib import Parallel, delayed
12 | from skimage.color import rgb2lab
13 |
14 | import config
15 | import general_utils
16 | import itertools
17 |
18 | from general_utils import make_otsu
19 |
20 |
21 | def percent_otsu_score(tile):
22 | """
23 | Get percent foreground score.
24 | :param tile: PIL.Image (greyscale)
25 | :return: float [0,1] of percent foregound in tile
26 | """
27 | assert isinstance(tile, Image.Image)
28 | arr = np.array(tile)
29 | return np.mean(arr)
30 |
31 |
32 | def purple_score(tile_):
33 | """
34 | Get percent purple score.
35 | :param tile_: PIL.Image (RGB)
36 | :return: float [0, 1] of percent purple pixels in tile
37 | """
38 | assert isinstance(tile_, Image.Image)
39 | tile = np.array(tile_)
40 | r, g, b = tile[..., 0], tile[..., 1], tile[..., 2]
41 | score = np.sum((r > (g + 10)) & (b > (g + 10)))
42 | return score / tile.size
43 |
44 |
45 | def score_tiles(otsu_img, rgb_img, tile_size):
46 | """
47 | Get scores for tiles based on percent foreground. When tile_size and img size are downscaled
48 | proportionally, these coordinates map directly into the slide with proportionately upscaled
49 | tile_size and img size.
50 | :param otsu_img: np.ndarray, possibly downsampled. binary thresholded
51 | :param rgb_img: np.ndarray, possibly downsampled. RGB. same size as otsu_img
52 | :param tile_size: side length
53 | :return: list of (int_x, int_y) tuples
54 | """
55 | assert isinstance(otsu_img, np.ndarray) and isinstance(tile_size, int)
56 | otsu_slide = general_utils.array_to_slide(otsu_img)
57 | otsu_generator, otsu_generator_level = general_utils.get_full_resolution_generator(otsu_slide,
58 | tile_size=tile_size)
59 | rgb_slide = general_utils.array_to_slide(rgb_img)
60 | rgb_generator, rgb_generator_level = general_utils.get_full_resolution_generator(rgb_slide,
61 | tile_size=tile_size)
62 |
63 | tile_x_count, tile_y_count = otsu_generator.level_tiles[otsu_generator_level]
64 | address_list = []
65 | for count, address in enumerate(itertools.product(range(tile_x_count), range(tile_y_count))):
66 | if count > 1000 and len(address_list) > 10 and PROTOTYPE:
67 | break
68 | dimensions = otsu_generator.get_tile_dimensions(otsu_generator_level, address)
69 | if not (dimensions[0] == tile_size) or not (dimensions[1] == tile_size):
70 | continue
71 |
72 | rgb_tile = rgb_generator.get_tile(rgb_generator_level, address)
73 | if not purple_score(rgb_tile) > config.args.purple_threshold:
74 | continue
75 |
76 | otsu_tile = otsu_generator.get_tile(otsu_generator_level, address)
77 | otsu_score = percent_otsu_score(otsu_tile)
78 |
79 | if otsu_score < config.args.otsu_threshold:
80 | continue
81 |
82 | address_list.append(address)
83 |
84 | return address_list
85 |
86 |
87 | def score_tiles_manual(ann_file_name, otsu_img, thumbnail, tile_size, overlap):
88 | assert isinstance(thumbnail, np.ndarray) and isinstance(tile_size, int)
89 |
90 | try:
91 | with open(ann_file_name, 'r') as f:
92 | slide_annotations = json.load(f)['features']
93 | except FileNotFoundError:
94 | print('Warning: {} not found'.format(ann_file_name))
95 | return []
96 |
97 | slide = general_utils.array_to_slide(thumbnail)
98 | generator, generator_level = general_utils.get_full_resolution_generator(slide,
99 | tile_size=tile_size,
100 | overlap=overlap)
101 | otsu_slide = general_utils.array_to_slide(otsu_img)
102 | otsu_generator, otsu_generator_level = general_utils.get_full_resolution_generator(otsu_slide,
103 | tile_size=tile_size,
104 | overlap=overlap)
105 | tile_x_count, tile_y_count = generator.level_tiles[generator_level]
106 | print('{}, {}'.format(tile_x_count, tile_y_count))
107 | address_list = []
108 | for count, address in enumerate(itertools.product(range(tile_x_count), range(tile_y_count))):
109 | if count > 1000 and len(address_list) > 10 and PROTOTYPE:
110 | break
111 | dimensions = generator.get_tile_dimensions(generator_level, address)
112 | assert isinstance(tile_size, int)
113 | if dimensions[0] != (tile_size + 2*overlap) or dimensions[1] != (tile_size + 2*overlap):
114 | continue
115 |
116 | tile_location, _level_, new_tile_size = generator.get_tile_coordinates(generator_level, address)
117 | assert _level_ == 0
118 | tile_class = is_tile_in_annotations(tile_location, new_tile_size, slide_annotations)
119 | if not tile_class:
120 | continue
121 | else:
122 | if int(tile_class) in [3, 4, 6, 7]:
123 | otsu_tile = otsu_generator.get_tile(otsu_generator_level, address)
124 | otsu_score = percent_otsu_score(otsu_tile)
125 |
126 | if otsu_score < config.args.otsu_threshold:
127 | continue
128 | address_list.append([address, tile_class])
129 | return address_list
130 |
131 |
132 | def is_tile_in_annotations(tile_location, tile_size, slide_annotations):
133 | """
134 | Determine whether tile is in annotations, and if so, what class of annotation.
135 | :param tile_location:
136 | :param tile_size:
137 | :param slide_annotations:
138 | :return: 0 if not in annotation, else annotation index of first region that tile falls in
139 | """
140 | points = [Point(tile_location[0], tile_location[1]),
141 | Point(tile_location[0] + tile_size[0], tile_location[1]),
142 | Point(tile_location[0], tile_location[1] + tile_size[1]),
143 | Point(tile_location[0] + tile_size[0], tile_location[1] + tile_size[1])]
144 | for annotation_ in slide_annotations:
145 | point_count = 0
146 | class_ = annotation_['properties']['label_num']
147 | assert annotation_['geometry']['type'] == 'Polygon'
148 |
149 | coords = annotation_['geometry']['coordinates']
150 | if len(coords) >= 3:
151 | annotation = Polygon(coords)
152 | for point in points:
153 | if annotation.contains(point):
154 | point_count += 1
155 | if point_count > 3:
156 | return class_
157 | return 0
158 |
159 |
160 | def get_slide_tile_addresses(image_path, mag, scale, desired_tile_selection_size, index, tile_selection, overlap, annotation_path=None, visualize=False):
161 | assert isinstance(overlap, int)
162 |
163 | slide = OpenSlide(image_path)
164 |
165 | slide_mag = general_utils.get_magnification(slide)
166 | scale = general_utils.adjust_scale_for_slide_mag(slide_mag=slide_mag, desired_mag=mag, scale=scale)
167 | thumbnail = general_utils.get_downscaled_thumbnail(slide, scale)
168 | overlap = int(overlap // scale)
169 | otsu_thumbnail = make_otsu(thumbnail)
170 | if tile_selection == 'otsu':
171 | assert config.args.overlap == 0
172 | tile_addresses = score_tiles(otsu_thumbnail,
173 | thumbnail,
174 | tile_size=desired_tile_selection_size)
175 | tile_addresses = [[x, -1] for x in tile_addresses] # for consistent formatting with manual
176 | elif tile_selection == 'manual':
177 | assert annotation_path is not None
178 | tile_addresses = score_tiles_manual(annotation_path,
179 | otsu_thumbnail,
180 | thumbnail,
181 | tile_size=desired_tile_selection_size,
182 | overlap=overlap)
183 | else:
184 | raise RuntimeError
185 |
186 | if visualize:
187 | thumbnail = general_utils.visualize_tiling(thumbnail,
188 | desired_tile_selection_size,
189 | tile_addresses,
190 | overlap=overlap)
191 | thumbnail = Image.fromarray(thumbnail)
192 | thumbnail.save('tiling_visualizations/{}_tiling.png'.format(image_path.split('/')[-1]))
193 | return {index: tile_addresses}
194 |
195 |
196 | if __name__ == '__main__':
197 | PROTOTYPE = False
198 | visualize = False
199 | serial = False
200 |
201 | scale_factor = config.args.tile_size / config.desired_otsu_thumbnail_tile_size
202 |
203 | df = pd.read_csv(config.args.cohort_csv_path)
204 | with open('../global_config.yaml', 'r') as f:
205 | DIRECTORIES = yaml.safe_load(f)
206 | DATA_DIR = DIRECTORIES['data_dir']
207 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x))
208 | if 'segmentation_path' in df.columns:
209 | df['segmentation_path'] = df['segmentation_path'].apply(lambda x: os.path.join(DATA_DIR, x))
210 |
211 | if PROTOTYPE:
212 | df = df.head(4)
213 |
214 | coords = {}
215 | if serial:
216 | for index, row in df.iterrows():
217 | print(row['image_path'])
218 | tile_addresses = get_slide_tile_addresses(row['image_path'],
219 | mag=config.args.magnification,
220 | scale=scale_factor,
221 | desired_tile_selection_size=
222 | config.desired_otsu_thumbnail_tile_size,
223 | index=index,
224 | tile_selection=config.args.tile_selection_type,
225 | overlap=config.args.overlap,
226 | visualize=visualize,
227 | annotation_path=row['segmentation_path'] if 'segmentation_path' in df.columns else None)
228 | coords.update(tile_addresses)
229 | else:
230 | _dicts = Parallel(n_jobs=64)(delayed(get_slide_tile_addresses)(row['image_path'],
231 | mag=config.args.magnification,
232 | scale=scale_factor,
233 | desired_tile_selection_size=
234 | config.desired_otsu_thumbnail_tile_size,
235 | index=index,
236 | tile_selection=config.args.tile_selection_type,
237 | overlap=config.args.overlap,
238 | visualize=visualize,
239 | annotation_path=row['segmentation_path'] if 'segmentation_path' in df.columns else None)
240 | for index, row in df.iterrows())
241 | for _dict in _dicts:
242 | coords.update(_dict)
243 |
244 | df['tile_address'] = pd.Series(coords)
245 | df['n_foreground_tiles'] = df['tile_address'].map(len)
246 |
247 | df.to_csv(config.args.preprocessed_cohort_csv_path, index=False)
248 |
--------------------------------------------------------------------------------
/tissue-type-training/pretile.py:
--------------------------------------------------------------------------------
1 | import general_utils
2 | import config
3 | import os
4 | import numpy as np
5 | import yaml
6 |
7 | from PIL import Image
8 | from openslide import OpenSlide
9 | from openslide.lowlevel import OpenSlideUnsupportedFormatError
10 | from openslide.deepzoom import DeepZoomGenerator
11 | from joblib import delayed, Parallel
12 | from skimage.color import rgb2lab, lab2rgb
13 |
14 | from general_utils import make_otsu
15 |
16 |
17 | # normalization tools from https://github.com/CODAIT/deep-histopath/blob/master/deephistopath/preprocessing.py
18 | stain_ref = np.array([[0.56237296, 0.38036293],
19 | [0.72830425, 0.83254214],
20 | [0.39154767, 0.40273766]])
21 | max_sat_ref = np.array([[0.62245465],
22 | [0.44427557]])
23 |
24 | beta = 0.15
25 | alpha = 1
26 | light_intensity = 255
27 |
28 |
29 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py)
30 | def get_standard_luminosity_limit(rgb):
31 | assert isinstance(rgb, np.ndarray)
32 | lab = rgb2lab(rgb)
33 | p = np.percentile(lab[:, :, 0], 95)
34 | return p
35 |
36 |
37 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py)
38 | def apply_standard_luminosity_limit(rgb, p):
39 | assert isinstance(rgb, np.ndarray)
40 | lab = rgb2lab(rgb)
41 | lab[:, :, 0] = np.clip(100 * lab[:, :, 0] / p, 0, 100)
42 | return np.round(np.clip(255 * lab2rgb(lab), 0, 255)).astype(np.uint8)
43 |
44 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py)
45 | def calculate_macenko_transform(to_transform):
46 | assert isinstance(to_transform, np.ndarray)
47 |
48 | c = to_transform.shape[2]
49 | assert c == 3
50 |
51 | luminosity_limit = get_standard_luminosity_limit(to_transform)
52 | to_transform = apply_standard_luminosity_limit(to_transform, luminosity_limit)
53 |
54 | im = rgb2lab(to_transform)
55 | ignore_mask = ((im[:, :,0] < 50) & (np.abs(im[:,:,1] - im[:,:,2]) < 30)) | (im[:,:,2] < -40) | (im[:,:,0] > 90) | (im[:,:,1] > 40)
56 | to_transform = to_transform[~ignore_mask, :]
57 |
58 | to_transform = to_transform.reshape(-1, c).astype(np.float64) # shape (H*W, C)
59 |
60 | # Convert RGB to OD.
61 | OD = -np.log10(to_transform/light_intensity + 1e-8)
62 |
63 | # Remove data with OD intensity less than beta.
64 | OD_thresh = OD[np.all(OD >= beta, 1), :]
65 |
66 | # Calculate eigenvectors.
67 | U, s, V = np.linalg.svd(OD_thresh, full_matrices=False)
68 |
69 | # Extract two largest eigenvectors.
70 | top_eigvecs = V[0:2, :].T * -1 # shape (C, 2)
71 |
72 | # Project thresholded optical density values onto plane spanned by
73 | # 2 largest eigenvectors.
74 | proj = np.dot(OD_thresh, top_eigvecs) # shape (K, 2)
75 |
76 | # Calculate angle of each point wrt the first plane direction.
77 | # Note: the parameters are `np.arctan2(y, x)`
78 | angles = np.arctan2(proj[:, 1], proj[:, 0]) # shape (K,)
79 |
80 | # Find robust extremes (a and 100-a percentiles) of the angle.
81 | min_angle = np.percentile(angles, alpha)
82 | max_angle = np.percentile(angles, 100 - alpha)
83 |
84 | # Convert min/max vectors (extremes) back to optimal stains in OD space.
85 | # This computes a set of axes for each angle onto which we can project
86 | # the top eigenvectors. This assumes that the projected values have
87 | # been normalized to unit length.
88 | extreme_angles = np.array(
89 | [[np.cos(min_angle), np.cos(max_angle)],
90 | [np.sin(min_angle), np.sin(max_angle)]]
91 | ) # shape (2,2)
92 | stains = np.dot(top_eigvecs, extreme_angles) # shape (C, 2)
93 |
94 | # Merge vectors with hematoxylin first, and eosin second, as a heuristic.
95 | if stains[0, 0] < stains[0, 1]:
96 | stains[:, [0, 1]] = stains[:, [1, 0]] # swap columns
97 |
98 | # Calculate saturations of each stain.
99 | # Note: Here, we solve
100 | # OD = VS
101 | # S = V^{-1}OD
102 | # where `OD` is the matrix of optical density values of our image,
103 | # `V` is the matrix of stain vectors, and `S` is the matrix of stain
104 | # saturations. Since this is an overdetermined system, we use the
105 | # least squares solver, rather than a direct solve.
106 | sats, _, _, _ = np.linalg.lstsq(stains, OD.T, rcond=None)
107 |
108 | # Normalize stain saturations to have same pseudo-maximum based on
109 | # a reference max saturation.
110 | max_sat = np.percentile(sats, 99, axis=1, keepdims=True)
111 | return stains, max_sat, luminosity_limit
112 |
113 |
114 | # credit: StainTools (https://github.com/Peter554/StainTools/blob/master/staintools/preprocessing/luminosity_standardizer.py)
115 | def apply_macenko_transform(stains, max_sat, luminosity_limit, to_transform):
116 | assert isinstance(to_transform, np.ndarray)
117 |
118 | h, w, c = to_transform.shape
119 | assert c == 3
120 |
121 | to_transform = apply_standard_luminosity_limit(to_transform, luminosity_limit)
122 |
123 | to_transform = to_transform.reshape(-1, c).astype(np.float64) # shape (H*W, C)
124 |
125 | # Convert RGB to OD.
126 | OD = -np.log10(to_transform/light_intensity + 1e-8)
127 |
128 | # Calculate saturations of each stain.
129 | # Note: Here, we solve
130 | # OD = VS
131 | # S = V^{-1}OD
132 | # where `OD` is the matrix of optical density values of our image,
133 | # `V` is the matrix of stain vectors, and `S` is the matrix of stain
134 | # saturations. Since this is an overdetermined system, we use the
135 | # least squares solver, rather than a direct solve.
136 | sats, _, _, _ = np.linalg.lstsq(stains, OD.T, rcond=None)
137 |
138 | # Normalize stain saturations to have same pseudo-maximum based on
139 | # a reference max saturation.
140 | sats = sats / max_sat * max_sat_ref
141 |
142 | # Compute optimal OD values.
143 | OD_norm = np.dot(stain_ref, sats)
144 |
145 | # Recreate image.
146 | # Note: If the image is immediately converted to uint8 with `.astype(np.uint8)`, it will
147 | # not return the correct values due to the initial values being outside of [0,255].
148 | # To fix this, we round to the nearest integer, and then clip to [0,255], which is the
149 | # same behavior as Matlab.
150 | # x_norm = np.exp(OD_norm) * light_intensity # natural log approach
151 | x_norm = 10 ** (-OD_norm) * light_intensity - 1e-8 # log10 approach
152 | x_norm = np.clip(np.round(x_norm), 0, 255).astype(np.uint8)
153 | x_norm = x_norm.astype(np.uint8)
154 | x_norm = x_norm.T.reshape(h, w, c)
155 | return x_norm
156 |
157 |
158 | def pretile_slide(row, tile_size, tile_dir, normalize=False, overlap=0):
159 | slide_dir = os.path.join(tile_dir, row['image_path'].split('/')[-1].replace('.svs', ''))
160 | if os.path.exists(slide_dir):
161 | n_tiles_saved = len(os.listdir(slide_dir))
162 | if n_tiles_saved == row.n_foreground_tiles:
163 | print("{} fully tiled; skipping".format(slide_dir))
164 | return
165 | else:
166 | os.mkdir(slide_dir)
167 |
168 | slide = OpenSlide(row['image_path'])
169 |
170 | slide_mag = general_utils.get_magnification(slide)
171 | if slide_mag == config.args.magnification:
172 | level_offset = 0
173 | elif slide_mag == 2 * config.args.magnification:
174 | level_offset = 1
175 | elif slide_mag == 4 * config.args.magnification:
176 | level_offset = 2
177 | else:
178 | raise NotImplementedError
179 |
180 | if normalize:
181 | size0, size1 = slide.dimensions
182 | stains, max_sat, luminosity_limit = calculate_macenko_transform(
183 | np.array(slide.get_thumbnail((size0//16, size1//16))))
184 | else:
185 | stains, max_sat, luminosity_limit = None, None, None
186 |
187 | addresses = row.tile_address
188 | generator, level = general_utils.get_full_resolution_generator(slide, tile_size=tile_size,
189 | level_offset=level_offset,
190 | overlap=overlap)
191 | for address, class_ in addresses:
192 | tile_file_name = os.path.join(slide_dir,
193 | str(address).replace('(', '').replace(')', '').replace(', ', '_') + '.png')
194 | if os.path.exists(tile_file_name):
195 | continue
196 | tile = generator.get_tile(level, address)
197 |
198 | if normalize:
199 | tile = apply_macenko_transform(stains, max_sat, luminosity_limit, np.array(tile))
200 | tile = Image.fromarray(tile)
201 | tile.save(tile_file_name)
202 |
203 |
204 | if __name__ == '__main__':
205 | serial = True
206 | df = general_utils.load_preprocessed_df(file_name=config.args.preprocessed_cohort_csv_path,
207 | min_n_tiles=1,
208 | explode=False)
209 |
210 | with open('../global_config.yaml', 'r') as f:
211 | DIRECTORIES = yaml.safe_load(f)
212 | DATA_DIR = DIRECTORIES['data_dir']
213 | df['image_path'] = df['image_path'].apply(lambda x: os.path.join(DATA_DIR, x))
214 |
215 | if serial:
216 | for _, row in df.iterrows():
217 | pretile_slide(row=row,
218 | tile_size=config.args.tile_size,
219 | tile_dir=config.args.tile_dir,
220 | normalize=config.args.normalize,
221 | overlap=config.args.overlap)
222 | else:
223 | Parallel(n_jobs=64)(delayed(pretile_slide)(row=row,
224 | tile_size=config.args.tile_size,
225 | tile_dir=config.args.tile_dir,
226 | normalize=config.args.normalize,
227 | overlap=config.args.overlap)
228 | for _, row in df.iterrows())
229 |
--------------------------------------------------------------------------------
/tissue-type-training/pretilings/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/pretilings/.gitkeep
--------------------------------------------------------------------------------
/tissue-type-training/train_on_all_annotations.sh:
--------------------------------------------------------------------------------
1 | ARGS="--magnification 20
2 | --tile_dir pretilings
3 | --cohort_csv_path data/dataframes/tissuetype_hne_df.csv
4 | --preprocessed_cohort_csv_path data/dataframes/preprocessed_tissuetype_hne_df.csv
5 | --tile_selection_type manual
6 | --otsu_threshold 0.2
7 | --batch_size 96
8 | --min_n_tiles 1
9 | --num_epochs 22
10 | --crossval 0
11 | --gpu 0
12 | --overlap 32
13 | --tile_size 64
14 | --normalize
15 | --model resnet18
16 | --experiment_name fulldata
17 | --learning_rate 0.0005"
18 |
19 | python train_tissue_tile_clf.py ${ARGS}
20 |
--------------------------------------------------------------------------------
/tissue-type-training/train_tissue_tile_clf.py:
--------------------------------------------------------------------------------
1 | from torch.optim import Adam
2 | from torch.utils.data import DataLoader
3 | from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
4 | from sklearn.utils import compute_class_weight
5 |
6 | import numpy as np
7 | import csv
8 | import pickle
9 | import config
10 | import general_utils
11 | import torch
12 | import re
13 | import pandas as pd
14 | import os
15 |
16 | from dataset import TissueTileDataset
17 | from general_utils import get_checkpoint_path, get_train_transforms, get_val_transforms, \
18 | log_results_string
19 | from models import TissueTileNet, get_model
20 |
21 |
22 | def train_epoch(model, train_loader, optimizer, device, criterion):
23 | model.train()
24 | total_loss = 0
25 | y_pred = []
26 | y_true = []
27 | n = len(train_loader.dataset)
28 | for idx, tiles, labels in train_loader:
29 | # forward
30 | optimizer.zero_grad()
31 | output = model(tiles.to(device))
32 |
33 | # calculate loss
34 | loss = criterion(input=output, target=labels.long().to(device))
35 | loss.backward()
36 |
37 | # backward
38 | optimizer.step()
39 | total_loss += loss.detach().cpu().item()
40 |
41 | # keep track of true and predicted classes
42 | y_pred.extend(output.argmax(1).detach().cpu().reshape(-1).numpy())
43 | y_true.extend(labels.reshape(-1).numpy())
44 |
45 | total_loss /= n
46 | acc = accuracy_score(y_true=y_true, y_pred=y_pred)
47 | f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro')
48 | confusion = confusion_matrix(y_true=y_true, y_pred=y_pred)
49 | return total_loss, acc, f1, confusion
50 |
51 |
52 | def validate_epoch(model, val_loader, device, criterion):
53 | model.eval()
54 | total_loss = 0
55 | y_pred = []
56 | y_true = []
57 | n = len(val_loader.dataset)
58 | with torch.no_grad():
59 | for idx, tiles, labels in val_loader:
60 | # forward
61 | output = model(tiles.to(device))
62 |
63 | # calculate loss
64 | loss = criterion(input=output, target=labels.long().to(device))
65 | total_loss += loss.detach().cpu().item()
66 |
67 | # keep track of true and predicted classes
68 | y_pred.extend(output.argmax(1).detach().cpu().reshape(-1).numpy())
69 | y_true.extend(labels.reshape(-1).numpy())
70 |
71 | total_loss /= n
72 | acc = accuracy_score(y_true=y_true, y_pred=y_pred)
73 | f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro')
74 | confusion = confusion_matrix(y_true=y_true, y_pred=y_pred)
75 | return total_loss, acc, f1, confusion
76 |
77 |
78 | def make_preds(model, loader, device, file_name, n_classes):
79 | header = ['tile_file_name', 'label']
80 | header.extend(['score_{}'.format(k) for k in range(n_classes)])
81 | with open(file_name, 'w', newline='') as file:
82 | writer = csv.writer(file, delimiter=',')
83 | writer.writerow(header)
84 |
85 | model.eval()
86 | with torch.no_grad():
87 | for ids, tiles, labels in loader:
88 | preds = model(tiles.to(device))
89 | preds = preds.detach().cpu().tolist()
90 | for idx, label, pred_list in zip(ids, labels.tolist(), preds):
91 | row = [idx, label]
92 | row.extend(pred_list)
93 | writer.writerow(row)
94 |
95 |
96 | def serialize(device_id, df, experiment_name, fold):
97 | starting_epoch = 0
98 | device = torch.device('cuda:{}'.format(device_id))
99 | train_df = df[df.split == 'train'].copy()
100 | assert 'val' not in train_df.split
101 | val_df = df[df.split == 'val'].copy()
102 | assert 'train' not in val_df.split
103 |
104 | print('train ({}):'.format(len(train_df)))
105 | print(train_df.tile_class.value_counts())
106 | print('val ({}):'.format(len(val_df)))
107 | print(val_df.tile_class.value_counts())
108 | do_validation = len(val_df) > 0
109 |
110 | train_dataset = TissueTileDataset(df=train_df,
111 | tile_dir=config.args.tile_dir,
112 | transforms=get_train_transforms(normalize=True))
113 | train_loader = DataLoader(train_dataset,
114 | batch_size=config.args.batch_size,
115 | num_workers=8,
116 | pin_memory=False,
117 | shuffle=True)
118 |
119 | if do_validation:
120 | val_dataset = TissueTileDataset(df=val_df,
121 | tile_dir=config.args.tile_dir,
122 | transforms=get_val_transforms(normalize=True))
123 | val_loader = DataLoader(val_dataset,
124 | batch_size=config.args.batch_size,
125 | num_workers=8)
126 |
127 | model = TissueTileNet(model=get_model(config), n_classes=n_classes)
128 | model.to(device)
129 |
130 | optimizer = Adam(model.parameters(),
131 | lr=config.args.learning_rate,
132 | weight_decay=config.args.weight_decay)
133 |
134 | criterion = torch.nn.CrossEntropyLoss(weight=torch.tensor(compute_class_weight(
135 | class_weight='balanced',
136 | classes=np.sort(train_df.tile_class.unique()),
137 | y=train_df.tile_class)).to(device).float())
138 |
139 | for epoch in range(config.args.num_epochs):
140 | train_loss, train_acc, train_f1, train_confusion = train_epoch(model,
141 | train_loader,
142 | optimizer,
143 | device,
144 | criterion)
145 | if do_validation:
146 | val_loss, val_acc, val_f1, val_confusion = validate_epoch(model,
147 | val_loader,
148 | device,
149 | criterion)
150 | else:
151 | val_loss, val_acc, val_f1, val_confusion = -1, -1, -1, -1
152 |
153 | log_results_string(epoch, starting_epoch, train_loss, val_loss, experiment_name, fold)
154 | results_str = 'training acc {:7.6f} | validation acc {:7.6f}'.format(train_acc, val_acc)
155 | print(results_str)
156 | results_str = 'training f1 {:7.6f} | validation f1 {:7.6f}'.format(train_f1, val_f1)
157 | print(results_str)
158 | print('training:\n' + str(train_confusion))
159 | print('validation:\n' + str(val_confusion))
160 | print('---')
161 | torch.save(model.state_dict(), get_checkpoint_path(experiment_name,
162 | epoch, starting_epoch,
163 | fold=fold))
164 |
165 |
166 | def prep_df(csv_path, tile_dir, map_classes=True):
167 | # load dataframe
168 | df_ = pd.read_csv(csv_path, low_memory=False)
169 | df_ = df_[df_.n_foreground_tiles > 0]
170 | df_.tile_address = df_.tile_address.map(eval)
171 | df_ = df_.explode('tile_address').reset_index()
172 | df_['tile_class'] = df_.tile_address.apply(lambda x: x[1])
173 | df_['tile_address'] = df_.tile_address.apply(lambda x: x[0])
174 |
175 | slideviewer_class_map = {1: 'Stroma',
176 | 2: 'Stroma',
177 | 3: 'Tumor',
178 | 4: 'Tumor',
179 | 5: 'Fat',
180 | 6: 'Vessel',
181 | 7: 'Vessel',
182 | 10: 'Necrosis',
183 | # 11: 'Glass',
184 | 14: 'Pen'}
185 | map_key = {'Stroma': 0,
186 | 'Tumor': 1,
187 | 'Fat': 2,
188 | 'Necrosis': 3}
189 | # 'Vessel': 3,
190 | # 'Necrosis': 4}
191 | # 'Glass': 5,
192 | # 'Pen': 5}
193 |
194 | n_classes = len(map_key)
195 | map_reverse_key = dict([(v, k) for k, v in map_key.items()])
196 |
197 | #print(df_.tile_class.value_counts())
198 | if map_classes:
199 | df_['tile_class'] = df_['tile_class'].map(slideviewer_class_map).map(map_key)
200 | df_ = df_[df_['tile_class'].isin(map_key.values())]
201 | #print(df_.tile_class.value_counts())
202 | #print(df_)
203 | df_['tile_file_name'] = tile_dir + '/' + \
204 | df_['image_path'].apply(lambda x: x.split('/')[-1].replace('.svs', '')) + '/' + \
205 | df_['tile_address'].apply(lambda x: str(x).replace('(', '').replace(')', '').replace(', ', '_') + '.png')
206 | return df_, n_classes, map_key, map_reverse_key
207 |
208 |
209 | if __name__ == '__main__':
210 | df_, n_classes, map_key, map_reverse_key = prep_df(config.args.preprocessed_cohort_csv_path,
211 | tile_dir=config.args.tile_dir)
212 | seed = 1123011750
213 |
214 | with open('checkpoints/{}_config.pickle'.format(config.args.experiment_name), 'wb') as f:
215 | pickle.dump(config.args, f)
216 |
217 | if config.args.crossval > 0:
218 | for fold, df in enumerate(general_utils.k_fold_ptwise_crossval(df_, config.args.crossval, seed)):
219 | print('\nFOLD {}'.format(fold))
220 | serialize(config.args.gpu[0], df, config.args.experiment_name, fold)
221 | elif config.args.crossval == 0:
222 | print('warning: no validation set')
223 | df_.loc[:, 'split'] = 'train'
224 | serialize(config.args.gpu[0], df_,config.args.experiment_name, -2)
225 |
--------------------------------------------------------------------------------
/tissue-type-training/visualizations/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kmboehm/onco-fusion/a219e4817c6f397f92a08c3e45b64f6a3336a793/tissue-type-training/visualizations/.gitkeep
--------------------------------------------------------------------------------