├── tests ├── utils │ ├── __init__.py │ ├── dir_union_fixtures │ │ ├── original │ │ │ └── test.txt │ │ └── patched │ │ │ ├── sub │ │ │ └── third.txt │ │ │ └── another.txt │ ├── test_dir_union.py │ └── dir_union.py ├── manuscripts │ ├── phenoplier_full │ │ ├── ci │ │ │ └── .gitkeep │ │ └── content │ │ │ ├── 10.references.md │ │ │ ├── 04.00.results.md │ │ │ ├── 01.abstract.md │ │ │ ├── 15.acknowledgements.md │ │ │ ├── 00.front-matter.md │ │ │ ├── 04.05.01.crispr.md │ │ │ ├── metadata.yaml │ │ │ └── 02.introduction.md │ ├── ccc │ │ ├── 04.00.results.md │ │ ├── 15.acknowledgements.md │ │ ├── 10.references.md │ │ ├── 08.05.methods.data.md │ │ ├── 08.20.methods.mic.md │ │ ├── 01.abstract.md │ │ ├── metadata.yaml │ │ ├── 00.front-matter.md │ │ ├── 08.15.methods.giant.md │ │ ├── 08.01.methods.ccc.md │ │ ├── 04.05.results_intro.md │ │ ├── 02.introduction.md │ │ ├── 04.10.results_comp.md │ │ ├── 04.12.results_giant.md │ │ └── 06.discussion.md │ ├── phenoplier_full_only_first_para │ │ ├── ci │ │ │ └── .gitkeep │ │ └── content │ │ │ ├── 10.references.md │ │ │ ├── 04.00.results.md │ │ │ ├── 07.00.methods.md │ │ │ ├── 04.05.01.crispr.md │ │ │ ├── 50.00.supplementary_material.md │ │ │ ├── 15.acknowledgements.md │ │ │ ├── 05.discussion.md │ │ │ ├── 04.05.00.results_framework.md │ │ │ ├── 02.introduction.md │ │ │ ├── 01.abstract.md │ │ │ ├── 04.15.drug_disease_prediction.md │ │ │ ├── 04.20.00.traits_clustering.md │ │ │ ├── 00.front-matter.md │ │ │ └── metadata.yaml │ ├── gbk_encoded │ │ ├── 01.abstract.md │ │ └── metadata.yaml │ ├── mutator-epistasis │ │ ├── 90.back-matter.md │ │ ├── 06.acknowledgments.md │ │ ├── manual-references.json │ │ ├── metadata.yaml │ │ ├── 01.abstract.md │ │ ├── 00.front-matter.md │ │ └── 02.introduction.md │ ├── custom │ │ ├── metadata.yaml │ │ ├── 00.results_image_with_no_caption.md │ │ └── 00.results_table_below_nonended_paragraph.md │ ├── ccc_non_standard_filenames │ │ ├── 01.ab.md │ │ ├── metadata.yaml │ │ ├── 00.front-matter.md │ │ ├── 08.01.meths.md │ │ ├── 04.05.res.md │ │ └── 02.beginning.md │ └── phenoplier │ │ ├── 50.00.supplementary_material.md │ │ ├── 50.01.supplementary_material.md │ │ └── metadata.yaml ├── config_loader_fixtures │ ├── single_generic_prompt │ │ ├── ai-revision-prompts.yaml │ │ └── ai-revision-config.yaml │ ├── phenoplier_full │ │ ├── ai-revision-prompts.yaml │ │ └── ai-revision-config.yaml │ ├── both_prompts_config │ │ ├── ai-revision-config.yaml │ │ └── ai-revision-prompts.yaml │ ├── conflicting_promptsfiles_matchings │ │ ├── ai-revision-config.yaml │ │ └── ai-revision-prompts.yaml │ ├── prompt_propogation │ │ ├── ai-revision-prompts.yaml │ │ └── ai-revision-config.yaml │ ├── prompt_gpt3_e2e │ │ ├── ai-revision-config.yaml │ │ └── ai-revision-prompts.yaml │ └── only_revision_prompts │ │ └── ai-revision-prompts.yaml ├── provider_fixtures │ ├── provider_model_engines.json │ └── refresh_model_engines.py ├── conftest.py └── test_model_providers.py ├── libs └── manubot_ai_editor │ ├── __init__.py │ ├── exceptions.py │ ├── utils.py │ └── env_vars.py ├── environment.yml ├── .github ├── release-drafter.yml └── workflows │ ├── draft-release.yml │ ├── publish-pypi.yml │ └── run-tests.yml ├── pyproject.toml ├── LICENSE.md ├── .pre-commit-config.yaml ├── CITATION.cff ├── .gitignore ├── docs ├── env-vars.md └── custom-prompts.md └── CODE_OF_CONDUCT.md /tests/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/ci/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/04.00.results.md: -------------------------------------------------------------------------------- 1 | ## Results 2 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/ci/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/utils/dir_union_fixtures/original/test.txt: -------------------------------------------------------------------------------- 1 | hello, world! -------------------------------------------------------------------------------- /tests/utils/dir_union_fixtures/patched/sub/third.txt: -------------------------------------------------------------------------------- 1 | a third file -------------------------------------------------------------------------------- /tests/manuscripts/ccc/15.acknowledgements.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tests/utils/dir_union_fixtures/patched/another.txt: -------------------------------------------------------------------------------- 1 | patched in via unify mock -------------------------------------------------------------------------------- /tests/config_loader_fixtures/single_generic_prompt/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts_files: 2 | \.md$: | 3 | Proofread the following paragraph 4 | -------------------------------------------------------------------------------- /tests/manuscripts/gbk_encoded/01.abstract.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/manubot/manubot-ai-editor/HEAD/tests/manuscripts/gbk_encoded/01.abstract.md -------------------------------------------------------------------------------- /tests/manuscripts/ccc/10.references.md: -------------------------------------------------------------------------------- 1 | ## References {.page_break_before} 2 | 3 | 4 |
5 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/90.back-matter.md: -------------------------------------------------------------------------------- 1 | ## References {.page_break_before} 2 | 3 | 4 |
5 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/10.references.md: -------------------------------------------------------------------------------- 1 | ## References {.page_break_before} 2 | 3 | 4 |
5 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/single_generic_prompt/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | ignore: 3 | - front\-matter 4 | - back\-matter 5 | - response\-to\-reviewers 6 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/10.references.md: -------------------------------------------------------------------------------- 1 | ## References {.page_break_before} 2 | 3 | 4 |
5 | -------------------------------------------------------------------------------- /libs/manubot_ai_editor/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Init file for manubot_ai_editor 3 | """ 4 | 5 | # note: version data is maintained by poetry-dynamic-versioning (do not edit) 6 | __version__ = "0.0.0" 7 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: manubot-ai-editor 2 | channels: 3 | - conda-forge 4 | - defaults 5 | dependencies: 6 | - openai==0.28 7 | - pip 8 | - pytest=7.* 9 | - python=3.10.* 10 | - pyyaml=6.* 11 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/04.00.results.md: -------------------------------------------------------------------------------- 1 | ## Results 2 | 3 | 11 | -------------------------------------------------------------------------------- /tests/manuscripts/gbk_encoded/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "A Chinese hello, world; 你好,世界" 3 | keywords: 4 | - encoding 5 | lang: zh-CN 6 | authors: 7 | - name: 姓名 8 | initials: 姓名 9 | orcid: 0000-0000-0000-0000 10 | twitter: 姓名 11 | email: 姓名@姓名.edu 12 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/06.acknowledgments.md: -------------------------------------------------------------------------------- 1 | ## Acknowledgments 2 | 3 | We thank Robert W. Williams (University of Tennessee Health Sciences Center) and Don F. Conrad (Oregon Health & Science University) for very helpful comments and feedback on a draft of this manuscript. -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/04.00.results.md: -------------------------------------------------------------------------------- 1 | ## Results 2 | 3 | 11 | -------------------------------------------------------------------------------- /libs/manubot_ai_editor/exceptions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Exception classes that are shared across modules in the project. 3 | """ 4 | 5 | 6 | class APIKeyInvalidError(Exception): 7 | """ 8 | Raised when a provider request is attempted with an invalid API key. 9 | """ 10 | 11 | pass 12 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/phenoplier_full/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts: 2 | abstract: | 3 | Test match abstract. 4 | 5 | introduction_discussion: | 6 | Test match introduction or discussion. 7 | 8 | results: | 9 | Test match results. 10 | 11 | methods: | 12 | Test match methods. 13 | 14 | my_default_prompt: | 15 | default prompt text 16 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/manual-references.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "id": "url:https://github.com/manubot/rootstock", 4 | "type": "webpage", 5 | "URL": "https://github.com/manubot/rootstock", 6 | "title": "manubot/rootstock GitHub repository", 7 | "container-title": "GitHub", 8 | "issued": { 9 | "date-parts": [ 10 | [ 11 | 2019 12 | ] 13 | ] 14 | }, 15 | "author": [ 16 | { 17 | "given": "Daniel", 18 | "family": "Himmelstein" 19 | } 20 | ] 21 | } 22 | ] 23 | -------------------------------------------------------------------------------- /.github/release-drafter.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # template configuration for release-drafter 3 | # see: https://github.com/release-drafter/release-drafter 4 | name-template: 'v$RESOLVED_VERSION' 5 | tag-template: 'v$RESOLVED_VERSION' 6 | version-resolver: 7 | major: 8 | labels: 9 | - 'release-major' 10 | minor: 11 | labels: 12 | - 'release-minor' 13 | patch: 14 | labels: 15 | - 'release-patch' 16 | default: patch 17 | change-template: '- $TITLE (@$AUTHOR via #$NUMBER)' 18 | template: | 19 | ## Changes 20 | 21 | $CHANGES -------------------------------------------------------------------------------- /tests/manuscripts/ccc/08.05.methods.data.md: -------------------------------------------------------------------------------- 1 | ### Gene expression data and preprocessing {#sec:data_gtex} 2 | 3 | We downloaded GTEx v8 data for all tissues, normalized using TPM (transcripts per million), and focused our primary analysis on whole blood, which has a good sample size (755). 4 | We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $log(x + 1)$ to avoid a bias towards highly-expressed genes. 5 | We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all 755 samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000. 6 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/both_prompts_config/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | matchings: 3 | - files: 4 | - abstract 5 | prompt: abstract 6 | - files: 7 | - introduction 8 | prompt: introduction_discussion 9 | - files: 10 | - 04\..+\.md 11 | prompt: results 12 | - files: 13 | - discussion 14 | prompt: introduction_discussion 15 | - files: 16 | - methods 17 | prompt: methods 18 | 19 | default_prompt: default 20 | 21 | ignore: 22 | - front-matter 23 | - acknowledgements 24 | - supplementary_material 25 | - references 26 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/phenoplier_full/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | matchings: 3 | - files: 4 | - abstract 5 | prompt: abstract 6 | - files: 7 | - introduction 8 | prompt: introduction_discussion 9 | - files: 10 | - 04\..+\.md 11 | prompt: results 12 | - files: 13 | - discussion 14 | prompt: introduction_discussion 15 | - files: 16 | - methods 17 | prompt: methods 18 | 19 | default_prompt: my_default_prompt 20 | 21 | ignore: 22 | - front\-matter 23 | - acknowledgements 24 | - supplementary_material 25 | - references -------------------------------------------------------------------------------- /tests/config_loader_fixtures/conflicting_promptsfiles_matchings/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | matchings: 3 | - files: 4 | - abstract 5 | prompt: abstract 6 | - files: 7 | - introduction 8 | prompt: introduction_discussion 9 | - files: 10 | - 04\..+\.md 11 | prompt: results 12 | - files: 13 | - discussion 14 | prompt: introduction_discussion 15 | - files: 16 | - methods 17 | prompt: methods 18 | 19 | default_prompt: default 20 | 21 | ignore: 22 | - front-matter 23 | - acknowledgements 24 | - supplementary_material 25 | - references 26 | -------------------------------------------------------------------------------- /.github/workflows/draft-release.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # workflow for drafting releases on GitHub 3 | # see: https://github.com/release-drafter/release-drafter 4 | name: release drafter 5 | 6 | on: 7 | push: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | draft_release: 13 | permissions: 14 | # write permission is required to create a github release 15 | contents: write 16 | # write permission is required for autolabeler 17 | # otherwise, read permission is required at least 18 | pull-requests: write 19 | runs-on: ubuntu-latest 20 | steps: 21 | - uses: release-drafter/release-drafter@v6 22 | env: 23 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} -------------------------------------------------------------------------------- /tests/manuscripts/ccc/08.20.methods.mic.md: -------------------------------------------------------------------------------- 1 | ### Maximal Information Coefficient (MIC) {#sec:methods:mic} 2 | 3 | We used the Python package `minepy` [@doi:10.1093/bioinformatics/bts707; @url:https://github.com/minepy/minepy] (version 1.2.5) to estimate the MIC coefficient. 4 | In GTEx v8 (whole blood), we used MICe (an improved implementation of the original MIC introduced in [@Reshef2016]) with the default parameters `alpha=0.6`, `c=15` and `estimator='mic_e'`. 5 | We used the `pairwise_distances` function from `scikit-learn` [@Sklearn2011] to parallelize the computation of MIC on GTEx. 6 | For our computational complexity analyses (see [Supplementary Material](#sec:time_test)), we ran the original MIC (using parameter `estimator='mic_approx'`) and MICe (`estimator='mic_e'`). 7 | -------------------------------------------------------------------------------- /tests/manuscripts/custom/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Manuscript Title" 3 | date: null # Defaults to date generated, but can specify like '2022-10-31'. 4 | keywords: 5 | - markdown 6 | - publishing 7 | - manubot 8 | lang: en-US 9 | authors: 10 | - github: johndoe 11 | name: John Doe 12 | initials: JD 13 | orcid: XXXX-XXXX-XXXX-XXXX 14 | twitter: johndoe 15 | email: john.doe@something.com 16 | affiliations: 17 | - Department of Something, University of Whatever 18 | funders: 19 | - Grant XXXXXXXX 20 | - github: janeroe 21 | name: Jane Roe 22 | initials: JR 23 | orcid: XXXX-XXXX-XXXX-XXXX 24 | email: jane.roe@whatever.edu 25 | affiliations: 26 | - Department of Something, University of Whatever 27 | - Department of Whatever, University of Something 28 | corresponding: true 29 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/prompt_propogation/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts: 2 | front_matter: This is the front-matter prompt 3 | abstract: This is the abstract prompt 4 | introduction: "This is the introduction prompt for the paper titled '{title}'." 5 | results: This is the results prompt 6 | results_framework: This is the results_framework prompt 7 | crispr: This is the crispr prompt 8 | drug_disease_prediction: This is the drug_disease_prediction prompt 9 | traits_clustering: This is the traits_clustering prompt 10 | discussion: This is the discussion prompt 11 | methods: This is the methods prompt 12 | references: This is the references prompt 13 | acknowledgements: This is the acknowledgements prompt 14 | supplementary_material: This is the supplementary_material prompt 15 | 16 | default: | 17 | This is the default prompt 18 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/07.00.methods.md: -------------------------------------------------------------------------------- 1 | ## Methods {#sec:methods} 2 | 3 | PhenoPLIER is a framework that combines different computational approaches to integrate gene-trait associations and drug-induced transcriptional responses with groups of functionally-related genes (referred to as gene modules or latent variables/LVs). 4 | Gene-trait associations are computed using the PrediXcan family of methods, whereas latent variables are inferred by the MultiPLIER models applied on large gene expression compendia. 5 | PhenoPLIER provides 6 | 1) a regression model to compute an LV-trait association, 7 | 2) a consensus clustering approach applied to the latent space to learn shared and distinct transcriptomic properties between traits, and 8 | 3) an interpretable, LV-based drug repurposing framework. 9 | We provide the details of these methods below. 10 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/04.05.01.crispr.md: -------------------------------------------------------------------------------- 1 | ### LVs link genes that alter lipid accumulation with relevant traits and tissues 2 | 3 | Our first experiment attempted to answer whether genes in a disease-relevant LV could represent potential therapeutic targets. 4 | For this, the first step was to obtain a set of genes strongly associated with a phenotype of interest. 5 | Therefore, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation ([Methods](#sec:methods:crispr)). 6 | From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids: 7 | a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*; 8 | and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary Data 2). 9 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/50.00.supplementary_material.md: -------------------------------------------------------------------------------- 1 | \clearpage 2 | 3 | ## Supplementary information {.page_break_before} 4 | 5 | ### Supplementary Note 1: mean type I error rates and calibration of LV-based regression model {#sm:reg:null_sim} 6 | 7 | We assessed our GLS model type I error rates (proportion of $p$-values below 0.05) and calibration using a null model of random traits and genotype data from 1000 Genomes Phase III. 8 | We selected 312 individuals with European ancestry, and then analyzed 1,000 traits drawn from a standard normal distribution $\mathcal{N}(0,1)$. 9 | We ran all the standard procedures for the TWAS approaches (S-PrediXcan and S-MultiXcan), including: 10 | 1) a standard GWAS using linear regression under an additive genetic model, 11 | 2) different GWAS processing steps, including harmonization and imputation procedures as defined in [@doi:10.1002/gepi.22346], 12 | 3) S-PrediXcan and S-MultiXcan analyses. 13 | Below we provide details for each of these steps. 14 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/15.acknowledgements.md: -------------------------------------------------------------------------------- 1 | ## Acknowledgements 2 | 3 | This study was funded by: 4 | the Gordon and Betty Moore Foundation (GBMF 4552 to C.S. Greene; GBMF 4560 to B.D. Sullivan), 5 | the National Human Genome Research Institute (R01 HG010067 to C.S. Greene, S.F.A. Grant and B.D. Sullivan; K99 HG011898 and R00 HG011898 to M. Pividori; U01 HG011181 to W. Wei), 6 | the National Cancer Institute (R01 CA237170 to C.S. Greene), 7 | the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765 to C.S. Greene), 8 | the National Institute of Aging (R01AG069900 to W. Wei), 9 | the National Institute of General Medical Sciences (R01 GM139891 to W. Wei); 10 | the National Heart, Lung, and Blood Institute (R01 HL163854 to Q. Feng); 11 | the National Institute of Diabetes and Digestive and Kidney Diseases (DK126194 to B.F. Voight); 12 | the Daniel B. Burke Endowed Chair for Diabetes Research to S.F.A. Grant; 13 | the Robert L. McNeil Jr. Endowed Fellowship in Translational Medicine and Therapeutics to C. Skarke. 14 | -------------------------------------------------------------------------------- /tests/manuscripts/custom/00.results_image_with_no_caption.md: -------------------------------------------------------------------------------- 1 | ## Results 2 | 3 | This is the revision of the first paragraph of the introduction of CCC. 4 | This is the revision of the first paragraph of the introduction of CCC. 5 | This is the revision of the first paragraph of the introduction of CCC. 6 | This is the revision of the first paragraph of the introduction of CCC. 7 | This is the revision of the first paragraph of the introduction of CCC. 8 | This is the revision of the first paragraph of the introduction of CCC. 9 | This is the revision of the first paragraph of the introduction of CCC. 10 | This is the revision of the first paragraph of the introduction of CCC: 11 | 12 | ![ 13 | ](images/diffs/introduction/ccc-paragraph-01.svg "Diffs - CCC introduction paragraph 01"){width="100%"} 14 | 15 | The tool, again, significantly revised the text, producing a much better and more concise introductory paragraph. 16 | For example, the revised first sentence (on the right) incorportes the ideas of "large datasets", and the "opportunities/possibilities" for "scientific exploration" in a clearly and briefly. 17 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/prompt_gpt3_e2e/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | matchings: 3 | - files: 4 | - front-matter 5 | prompt: front_matter 6 | - files: 7 | - abstract 8 | prompt: abstract 9 | - files: 10 | - introduction 11 | prompt: introduction 12 | - files: 13 | - results_framework 14 | prompt: results_framework 15 | - files: 16 | - results 17 | prompt: results 18 | - files: 19 | - crispr 20 | prompt: crispr 21 | - files: 22 | - drug_disease_prediction 23 | prompt: drug_disease_prediction 24 | - files: 25 | - traits_clustering 26 | prompt: traits_clustering 27 | - files: 28 | - discussion 29 | prompt: discussion 30 | - files: 31 | - methods 32 | prompt: methods 33 | - files: 34 | - references 35 | prompt: references 36 | - files: 37 | - acknowledgements 38 | prompt: acknowledgements 39 | - files: 40 | - supplementary_material 41 | prompt: supplementary_material 42 | 43 | default_prompt: default 44 | 45 | ignore: 46 | - results 47 | - references 48 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/prompt_propogation/ai-revision-config.yaml: -------------------------------------------------------------------------------- 1 | files: 2 | matchings: 3 | - files: 4 | - front-matter 5 | prompt: front_matter 6 | - files: 7 | - abstract 8 | prompt: abstract 9 | - files: 10 | - introduction 11 | prompt: introduction 12 | - files: 13 | - results_framework 14 | prompt: results_framework 15 | - files: 16 | - results 17 | prompt: results 18 | - files: 19 | - crispr 20 | prompt: crispr 21 | - files: 22 | - drug_disease_prediction 23 | prompt: drug_disease_prediction 24 | - files: 25 | - traits_clustering 26 | prompt: traits_clustering 27 | - files: 28 | - discussion 29 | prompt: discussion 30 | - files: 31 | - methods 32 | prompt: methods 33 | - files: 34 | - references 35 | prompt: references 36 | - files: 37 | - acknowledgements 38 | prompt: acknowledgements 39 | - files: 40 | - supplementary_material 41 | prompt: supplementary_material 42 | 43 | default_prompt: default 44 | 45 | ignore: 46 | - results 47 | - references 48 | -------------------------------------------------------------------------------- /.github/workflows/publish-pypi.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # used for publishing packages to pypi on release 3 | name: publish pypi release 4 | 5 | on: 6 | release: 7 | types: 8 | - published 9 | 10 | jobs: 11 | publish_pypi: 12 | runs-on: ubuntu-latest 13 | environment: release 14 | permissions: 15 | # IMPORTANT: this permission is mandatory for trusted publishing 16 | id-token: write 17 | steps: 18 | - name: Checkout 19 | uses: actions/checkout@v4 20 | with: 21 | fetch-depth: 0 22 | - name: Fetch tags 23 | run: git fetch --all --tags 24 | - name: Python setup 25 | uses: actions/setup-python@v5 26 | with: 27 | python-version: "3.11" 28 | - name: Setup for poetry 29 | run: | 30 | python -m pip install poetry 31 | poetry self add "poetry-dynamic-versioning[plugin]" 32 | - name: Install environment 33 | run: poetry install --no-interaction --no-ansi 34 | - name: poetry build distribution content 35 | run: poetry build 36 | - name: Publish package distributions to PyPI 37 | uses: pypa/gh-action-pypi-publish@release/v1 -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/05.discussion.md: -------------------------------------------------------------------------------- 1 | ## Discussion 2 | 3 | We have introduced a novel computational strategy that integrates statistical associations from TWAS with groups of genes (gene modules) that have similar expression patterns across the same cell types. 4 | Our key innovation is that we project gene-trait associations through a latent representation derived not strictly from measures of normal tissue but also from cell types under a variety of stimuli and at various developmental stages. 5 | This improves interpretation by going beyond statistical associations to infer cell type-specific features of complex phenotypes. 6 | Our approach can identify disease-relevant cell types from summary statistics, and several disease-associated gene modules were replicated in eMERGE. 7 | Using a CRISPR screen to analyze lipid regulation, we found that our gene module-based approach can prioritize causal genes even when single gene associations are not detected. 8 | We interpret these findings with an omnigenic perspective of "core" and "peripheral" genes, suggesting that the approach can identify genes that directly affect the trait with no mediated regulation of other genes and thus prioritize alternative and potentially more attractive therapeutic targets. 9 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/04.05.00.results_framework.md: -------------------------------------------------------------------------------- 1 | ### PhenoPLIER: an integration framework based on gene co-expression patterns 2 | 3 | PhenoPLIER is a flexible computational framework that combines gene-trait and gene-drug associations with gene modules expressed in specific contexts (Figure {@fig:entire_process}a). 4 | The approach uses a latent representation (with latent variables or LVs representing gene modules) derived from a large gene expression compendium (Figure {@fig:entire_process}b, top) to integrate TWAS with drug-induced transcriptional responses (Figure {@fig:entire_process}b, middle) for a joint analysis. 5 | The approach consists in three main components (Figure {@fig:entire_process}b, bottom, see [Methods](#sec:methods)): 6 | 1) an LV-based regression model to compute an association between an LV and a trait, 7 | 2) a clustering framework to learn groups of traits with shared transcriptomic properties, 8 | and 3) an LV-based drug repurposing approach that links diseases to potential treatments. 9 | We performed extensive simulations for our regression model ([Supplementary Note 1](#sm:reg:null_sim)) and clustering framework ([Supplementary Note 2](#sm:clustering:null_sim)) to ensure proper calibration and expected results under a model of no association. 10 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/02.introduction.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | Genes work together in context-specific networks to carry out different functions [@pmid:19104045; @doi:10.1038/ng.3259]. 4 | Variations in these genes can change their functional role and, at a higher level, affect disease-relevant biological processes [@doi:10.1038/s41467-018-06022-6]. 5 | In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776; @doi:10.1038/s41586-020-2559-3; @doi:10.1038/s41576-019-0200-9], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489]. 6 | Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787; @doi:10.1038/s41586-020-03145-z; @doi:10.1038/s41586-020-2559-3] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776]. 7 | Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4; @doi:10.1038/ncomms6890] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105]. 8 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/01.abstract.md: -------------------------------------------------------------------------------- 1 | ## Abstract {.page_break_before} 2 | 3 | Correlation coefficients are widely used to identify patterns in data that may be of particular interest. 4 | In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. 5 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. 6 | CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients. 7 | CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient. 8 | When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients. 9 | Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed. 10 | CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types. 11 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/01.ab.md: -------------------------------------------------------------------------------- 1 | ## Abstract {.page_break_before} 2 | 3 | Correlation coefficients are widely used to identify patterns in data that may be of particular interest. 4 | In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. 5 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. 6 | CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients. 7 | CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient. 8 | When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients. 9 | Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed. 10 | CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types. 11 | -------------------------------------------------------------------------------- /tests/manuscripts/custom/00.results_table_below_nonended_paragraph.md: -------------------------------------------------------------------------------- 1 | ## Results 2 | 3 | This is the revision of the first paragraph of the introduction of CCC. 4 | This is the revision of the first paragraph of the introduction of CCC. 5 | This is the revision of the first paragraph of the introduction of CCC. 6 | This is the revision of the first paragraph of the introduction of CCC. 7 | This is the revision of the first paragraph of the introduction of CCC. 8 | This is the revision of the first paragraph of the introduction of CCC. 9 | This is the revision of the first paragraph of the introduction of CCC. 10 | This is the revision of the first paragraph of the introduction of CCC: 11 | 12 | | Pathway | AUC | FDR | 13 | |:------------------------------------|:------|:---------| 14 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 | 15 | | SVM Neutrophils | 0.98 | 1.43e-09 | 16 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 | 17 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 | 18 | 19 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603} 20 | 21 | The tool, again, significantly revised the text, producing a much better and more concise introductory paragraph. 22 | For example, the revised first sentence (on the right) incorportes the ideas of "large datasets", and the "opportunities/possibilities" for "scientific exploration" in a clearly and briefly. 23 | -------------------------------------------------------------------------------- /.github/workflows/run-tests.yml: -------------------------------------------------------------------------------- 1 | --- 2 | name: run tests 3 | 4 | on: 5 | push: 6 | branches: [main] 7 | pull_request: 8 | branches: [main] 9 | 10 | jobs: 11 | pre_commit_checks: 12 | runs-on: ubuntu-24.04 13 | steps: 14 | - uses: actions/checkout@v4 15 | - uses: actions/setup-python@v5 16 | with: 17 | python-version: "3.10" 18 | - uses: pre-commit/action@v3.0.1 19 | id: pre_commit 20 | # run pre-commit ci lite for automated fixes 21 | - uses: pre-commit-ci/lite-action@v1.1.0 22 | if: ${{ !cancelled() && steps.pre_commit.outcome == 'failure' }} 23 | tests: 24 | strategy: 25 | matrix: 26 | # matrixed execution for parallel gh-action performance increases 27 | python_version: ["3.10", "3.11", "3.12", "3.13"] 28 | os: [ubuntu-24.04, macos-14] 29 | runs-on: ${{ matrix.os }} 30 | env: 31 | OS: ${{ matrix.os }} 32 | steps: 33 | - name: Checkout 34 | uses: actions/checkout@v4 35 | - name: Python setup 36 | uses: actions/setup-python@v5 37 | with: 38 | python-version: ${{ matrix.python_version }} 39 | - name: Setup poetry 40 | run: | 41 | pip install poetry 42 | - name: Install poetry env 43 | run: | 44 | poetry install 45 | - name: Run pytest 46 | env: 47 | # set placeholder API key, required by tests 48 | PROVIDER_API_KEY: ABCD1234 49 | run: poetry run pytest 50 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Epistasis between mutator alleles contributes to germline mutation rate variability in laboratory mice" 3 | date: null # Defaults to date generated, but can specify like '2022-10-31'. 4 | keywords: 5 | - markdown 6 | - publishing 7 | - manubot 8 | lang: en-US 9 | authors: 10 | - name: Thomas A. Sasani 11 | github: tomsasani 12 | initials: TAS 13 | orcid: 0000-0003-2317-1374 14 | twitter: tomsasani 15 | email: thomas.a.sasani@gmail.com 16 | affiliations: 17 | - Department of Human Genetics, University of Utah 18 | - name: Aaron R. Quinlan 19 | initials: ARQ 20 | orcid: 0000-0003-1756-0859 21 | twitter: aaronquinlan 22 | email: aquinlan@genetics.utah.edu 23 | affiliations: 24 | - Department of Human Genetics, University of Utah 25 | - Department of Biomedical Informatics, University of Utah 26 | funders: 27 | - NIH/NHGRI R01HG012252 28 | corresponding: true 29 | - name: Kelley Harris 30 | initials: KH 31 | orcid: 0000-0003-0302-2523 32 | twitter: Kelley__Harris 33 | email: harriske@uw.edu 34 | affiliations: 35 | - Department of Genome Sciences, University of Washington 36 | funders: 37 | - NIH/NIGMS R35GM133428 38 | - Burroughs Wellcome Career Award at the Scientific Interface 39 | - Searle Scholarship 40 | - Pew Scholarship 41 | - Sloan Fellowship 42 | - Allen Discovery Center for Cell Lineage Tracing 43 | corresponding: true 44 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/01.abstract.md: -------------------------------------------------------------------------------- 1 | ## Abstract {.page_break_before} 2 | 3 | Genes act in concert with each other in specific contexts to perform their functions. 4 | Determining how these genes influence complex traits requires a mechanistic understanding of expression regulation across different conditions. 5 | It has been shown that this insight is critical for developing new therapies. 6 | Transcriptome-wide association studies have helped uncover the role of individual genes in disease-relevant mechanisms. 7 | However, modern models of the architecture of complex traits predict that gene-gene interactions play a crucial role in disease origin and progression. 8 | Here we introduce PhenoPLIER, a computational approach that maps gene-trait associations and pharmacological perturbation data into a common latent representation for a joint analysis. 9 | This representation is based on modules of genes with similar expression patterns across the same conditions. 10 | We observe that diseases are significantly associated with gene modules expressed in relevant cell types, and our approach is accurate in predicting known drug-disease pairs and inferring mechanisms of action. 11 | Furthermore, using a CRISPR screen to analyze lipid regulation, we find that functionally important players lack associations but are prioritized in trait-associated modules by PhenoPLIER. 12 | By incorporating groups of co-expressed genes, PhenoPLIER can contextualize genetic associations and reveal potential targets missed by single-gene strategies. 13 | -------------------------------------------------------------------------------- /tests/utils/test_dir_union.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from unittest import mock 3 | 4 | from .dir_union import mock_unify_open, set_directory 5 | 6 | # tests for mock_unify_open 7 | 8 | UNIFY_TEST_DIR = Path(__file__).parent / "dir_union_fixtures" 9 | UNIFY_ORIG_DIR = UNIFY_TEST_DIR / "original" 10 | UNIFY_PATCHED_DIR = UNIFY_TEST_DIR / "patched" 11 | 12 | 13 | @mock.patch("builtins.open", mock_unify_open(UNIFY_ORIG_DIR, UNIFY_PATCHED_DIR)) 14 | def test_unify_folder_mock(): 15 | # test that we can still open files in the original folder 16 | with open(UNIFY_ORIG_DIR / "test.txt") as fp: 17 | assert fp.read().strip() == "hello, world!" 18 | # test that the patched folder takes precedence 19 | with open(UNIFY_ORIG_DIR / "another.txt") as fp: 20 | assert fp.read().strip() == "patched in via unify mock" 21 | 22 | 23 | @mock.patch("builtins.open", mock_unify_open(UNIFY_ORIG_DIR, UNIFY_PATCHED_DIR)) 24 | def test_unify_folder_mock_relative_paths(): 25 | with set_directory(UNIFY_ORIG_DIR): 26 | # test that we can still open files in the original folder 27 | with open("./test.txt") as fp: 28 | assert fp.read().strip() == "hello, world!" 29 | # test that the patched folder takes precedence 30 | with open("./another.txt") as fp: 31 | assert fp.read().strip() == "patched in via unify mock" 32 | # test that subfolders in the patched folder can be used 33 | with open("./sub/third.txt") as fp: 34 | assert fp.read().strip() == "a third file" 35 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/01.abstract.md: -------------------------------------------------------------------------------- 1 | ## Abstract {.page_break_before} 2 | 3 | Genes act in concert with each other in specific contexts to perform their functions. 4 | Determining how these genes influence complex traits requires a mechanistic understanding of expression regulation across different conditions. 5 | It has been shown that this insight is critical for developing new therapies. 6 | Transcriptome-wide association studies have helped uncover the role of individual genes in disease-relevant mechanisms. 7 | However, modern models of the architecture of complex traits predict that gene-gene interactions play a crucial role in disease origin and progression. 8 | Here we introduce PhenoPLIER, a computational approach that maps gene-trait associations and pharmacological perturbation data into a common latent representation for a joint analysis. 9 | This representation is based on modules of genes with similar expression patterns across the same conditions. 10 | We observe that diseases are significantly associated with gene modules expressed in relevant cell types, and our approach is accurate in predicting known drug-disease pairs and inferring mechanisms of action. 11 | Furthermore, using a CRISPR screen to analyze lipid regulation, we find that functionally important players lack associations but are prioritized in trait-associated modules by PhenoPLIER. 12 | By incorporating groups of co-expressed genes, PhenoPLIER can contextualize genetic associations and reveal potential targets missed by single-gene strategies. 13 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | build-backend = "poetry_dynamic_versioning.backend" 3 | requires = [ "poetry-core>=1", "poetry-dynamic-versioning>=1,<2" ] 4 | 5 | [tool.poetry] 6 | name = "manubot-ai-editor" 7 | # note: version data is maintained by poetry-dynamic-versioning (do not edit) 8 | version = "0.0.0" 9 | description = "A Manubot plugin to revise a manuscript using GPT-3" 10 | authors = [ "Milton Pividori " ] 11 | maintainers = [ 12 | "Milton Pividori", 13 | "Faisal Alquaddoomi", 14 | "Vincent Rubinetti", 15 | "Dave Bunten", 16 | ] 17 | license = "BSD-3-Clause" 18 | readme = "README.md" 19 | repository = "https://github.com/manubot/manubot-ai-editor" 20 | homepage = "https://github.com/manubot/manubot-ai-editor" 21 | classifiers = [ 22 | "Programming Language :: Python :: 3", 23 | "License :: OSI Approved :: BSD License", 24 | "Operating System :: OS Independent", 25 | ] 26 | packages = [ { include = "manubot_ai_editor", from = "libs" } ] 27 | 28 | [tool.poetry.dependencies] 29 | python = ">=3.10,<4.0" 30 | langchain-core = "^0.3.6" 31 | langchain-openai = "^0.2.0" 32 | langchain-anthropic = "^0.3.0" 33 | pyyaml = "*" 34 | charset-normalizer = "^3.4.1" 35 | 36 | [tool.poetry.group.dev.dependencies] 37 | pytest = ">=8.3.3" 38 | pytest-antilru = "^2.0.0" 39 | 40 | [tool.poetry.requires-plugins] 41 | poetry-dynamic-versioning = { version = ">=1.0.0,<2.0.0", extras = [ "plugin" ] } 42 | 43 | [tool.poetry-dynamic-versioning] 44 | enable = true 45 | style = "pep440" 46 | vcs = "git" 47 | substitution.files = [ "libs/manubot_ai_editor/__init__.py" ] 48 | 49 | [tool.setuptools_scm] 50 | root = "." 51 | -------------------------------------------------------------------------------- /tests/utils/dir_union.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | 4 | from contextlib import contextmanager 5 | 6 | 7 | @contextmanager 8 | def set_directory(new): 9 | """ 10 | Given a path, sets it as the current working directory, 11 | then sets it back once the context has been exited. 12 | 13 | Note that if we upgrade to Python 3.11, this method can be replaced 14 | with https://docs.python.org/3/library/contextlib.html#contextlib.chdir 15 | """ 16 | 17 | # store the current path so we can return to it 18 | original = Path().absolute() 19 | 20 | try: 21 | os.chdir(new) 22 | yield 23 | finally: 24 | os.chdir(original) 25 | 26 | 27 | def mock_unify_open(original, patched): 28 | """ 29 | Given paths to an 'original' and 'patched' folder, 30 | patches open() to first check the patched folder for the 31 | target file, then checks the original folder if it's not found 32 | in the patched folder. 33 | """ 34 | builtin_open = open 35 | 36 | def unify_open(*args, **kwargs): 37 | try: 38 | # first, try to open the file from within patched 39 | 40 | # resolve all paths: the original, patched, and requested file 41 | target_full_path = Path(args[0]).absolute() 42 | rewritten_path = str(target_full_path).replace( 43 | str(original.absolute()), str(patched.absolute()) 44 | ) 45 | 46 | return builtin_open(rewritten_path, *(args[1:]), **kwargs) 47 | except FileNotFoundError: 48 | # resort to opening it normally 49 | return builtin_open(*args, **kwargs) 50 | 51 | return unify_open 52 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # BSD 3-Clause License 2 | 3 | Copyright (c) 2024, Contributors and the Pividori Lab at the University of Colorado Anschutz Medical Campus 4 | 5 | All rights reserved. 6 | 7 | Redistribution and use in source and binary forms, with or without 8 | modification, are permitted provided that the following conditions are met: 9 | 10 | 1. Redistributions of source code must retain the above copyright notice, this 11 | list of conditions and the following disclaimer. 12 | 13 | 2. Redistributions in binary form must reproduce the above copyright notice, 14 | this list of conditions and the following disclaimer in the documentation 15 | and/or other materials provided with the distribution. 16 | 17 | 3. Neither the name of the copyright holder nor the names of its 18 | contributors may be used to endorse or promote products derived from 19 | this software without specific prior written permission. 20 | 21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 22 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 23 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 25 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 26 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 27 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 28 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 29 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 30 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 31 | -------------------------------------------------------------------------------- /libs/manubot_ai_editor/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import difflib 3 | 4 | import yaml 5 | 6 | 7 | SIMPLE_SENTENCE_END_PATTERN = re.compile(r"\.\s") 8 | SENTENCE_END_PATTERN = re.compile(r"\.\s(\S)") 9 | 10 | 11 | def get_yaml_field(yaml_file, field): 12 | """ 13 | Returns the value of a field in a YAML file. 14 | """ 15 | with open(yaml_file, "r") as f: 16 | data = yaml.safe_load(f) 17 | return data[field] 18 | 19 | 20 | def starts_with_similar(string: str, prefix: str, threshold: float = 0.8) -> bool: 21 | """ 22 | Returns True if the string starts with a prefix that is similar to the given prefix. 23 | """ 24 | return ( 25 | difflib.SequenceMatcher(None, prefix, string[: len(prefix)]).ratio() > threshold 26 | ) 27 | 28 | 29 | def get_obj_path(target: any, path: tuple, missing=None): 30 | """ 31 | Traverse a nested object using a tuple of keys, returning the last resolved 32 | value in the path. If any key is not found, return 'missing' (default None). 33 | 34 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'c')) 35 | 1 36 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'd')) is None 37 | True 38 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'd'), missing=2) 39 | 2 40 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 1, 'c')) 41 | 1 42 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 1, 'd')) is None 43 | True 44 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 3)) is None 45 | True 46 | """ 47 | try: 48 | for key in path: 49 | target = target[key] 50 | except (KeyError, IndexError, TypeError): 51 | return missing 52 | 53 | return target 54 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/04.15.drug_disease_prediction.md: -------------------------------------------------------------------------------- 1 | ### LVs predict drug-disease pairs better than single genes 2 | 3 | We next determined how substituting LVs for individual genes predicted known treatment-disease relationships. 4 | For this, we used the transcriptional responses to small molecule perturbations profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049], which were further processed and mapped to DrugBank IDs [@doi:10.1093/nar/gkt1068; @doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47223]. 5 | Based on an established drug repurposing strategy that matches reversed transcriptome patterns between genes and drug-induced perturbations [@doi:10.1126/scitranslmed.3002648; @doi:10.1126/scitranslmed.3001318], we adopted a previously described framework that uses imputed transcriptomes from TWAS to prioritize drug candidates [@doi:10.1038/nn.4618]. 6 | For this, we computed a drug-disease score by calculating the negative dot product between the $z$-scores for a disease (from TWAS) and the $z$-scores for a drug (from LINCS) across sets of genes of different sizes (see [Methods](#sec:methods:drug)). 7 | Therefore, a large score for a drug-disease pair indicated that higher (lower) predicted expression values of disease-associated genes are down (up)-regulated by the drug, thus predicting a potential treatment. 8 | Similarly, for the LV-based approach, we estimated how pharmacological perturbations affected the gene module activity by projecting expression profiles of drugs into our latent representation (Figure {@fig:entire_process}b). 9 | We used a manually-curated gold standard set of drug-disease medical indications [@doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47664] for 322 drugs across 53 diseases to evaluate the prediction performance. 10 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | default_language_version: 2 | python: python3.10 3 | repos: 4 | - repo: https://github.com/pre-commit/pre-commit-hooks 5 | rev: v6.0.0 6 | hooks: 7 | # Check for files that contain merge conflict strings. 8 | - id: check-merge-conflict 9 | # Check for debugger imports and py37+ `breakpoint()` calls in python source. 10 | - id: debug-statements 11 | # Replaces or checks mixed line ending 12 | - id: mixed-line-ending 13 | # Check for files that would conflict in case-insensitive filesystems 14 | - id: check-case-conflict 15 | # This hook checks toml files for parseable syntax. 16 | - id: check-toml 17 | # This hook checks yaml files for parseable syntax. 18 | - id: check-yaml 19 | - repo: https://github.com/charliermarsh/ruff-pre-commit 20 | rev: v0.14.8 21 | hooks: 22 | - id: ruff 23 | args: 24 | - --fix 25 | - repo: https://github.com/python/black 26 | rev: 25.12.0 27 | hooks: 28 | - id: black 29 | language_version: python3 30 | - repo: https://github.com/python-poetry/poetry 31 | rev: "2.2.1" 32 | hooks: 33 | - id: poetry-check 34 | - repo: https://github.com/tox-dev/pyproject-fmt 35 | rev: "v2.11.1" 36 | hooks: 37 | - id: pyproject-fmt 38 | - repo: https://github.com/rhysd/actionlint 39 | rev: v1.7.9 40 | hooks: 41 | - id: actionlint 42 | - repo: https://github.com/citation-file-format/cffconvert 43 | rev: b6045d78aac9e02b039703b030588d54d53262ac 44 | hooks: 45 | - id: validate-cff 46 | - repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update 47 | rev: v0.6.0 48 | hooks: 49 | - id: pre-commit-update 50 | args: ["--keep", "pre-commit-update", "--keep", "cffconvert"] 51 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier/50.00.supplementary_material.md: -------------------------------------------------------------------------------- 1 | ### Latent variables (gene modules) information 2 | 3 | #### LV603 4 | 5 | 6 | | Pathway | AUC | FDR | 7 | |:------------------------------------|:------|:---------| 8 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 | 9 | | SVM Neutrophils | 0.98 | 1.43e-09 | 10 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 | 11 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 | 12 | 13 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603} 14 | 15 | 16 | 17 | | Trait description | Sample size | Cases | FDR | 18 | |:------------------------------------------|:--------------|:--------|:---------------| 19 | | Basophill percentage | 349,861 | | 1.19e‑10 | 20 | | Basophill count | 349,856 | | 1.89e‑05 | 21 | | Treatment/medication code: ispaghula husk | 361,141 | 327 | 1.36e‑02 | 22 | 23 | Table: Significant trait associations of LV603 in PhenomeXcan. {#tbl:sup:phenomexcan_assocs:lv603} 24 | 25 | 26 | 27 | | Phecode | Trait description | Sample size | Cases | FDR | 28 | |:----------------------------|:--------------------|:--------------|:--------|:------| 29 | | No significant associations | | | | | 30 | 31 | Table: Significant trait associations of LV603 in eMERGE. {#tbl:sup:emerge_assocs:lv603} 32 | 33 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/04.20.00.traits_clustering.md: -------------------------------------------------------------------------------- 1 | ### LVs reveal trait clusters with shared transcriptomic properties 2 | 3 | We used the projection of gene-trait associations into the latent space to find groups of clusters linked by the same transcriptional processes. 4 | Since individual clustering algorithms have different biases (i.e., assumptions about the data structure), we designed a consensus clustering framework that combines solutions or partitions of traits generated by different methods ([Methods](#sec:methods:clustering)). 5 | Consensus or ensemble approaches have been recommended to avoid several pitfalls when performing cluster analysis on biological data [@doi:10.1126/scisignal.aad1932]. 6 | Since diversity in the ensemble is crucial for these methods, we generated different data versions which were processed using different methods with varying sets of parameters (Figure {@fig:clustering:design}a). 7 | Then, a consensus function combines the ensemble into a consolidated solution, which has been shown to outperform any individual member of the ensemble [@Strehl2002; @doi:10.1109/TPAMI.2005.113]. 8 | Our clustering pipeline generated 15 final consensus clustering solutions (Figure @fig:sup:clustering:agreement). 9 | The number of clusters of these partitions (between 5 to 29) was learned from the data by selecting the partitions with the largest agreement with the ensemble [@Strehl2002]. 10 | Instead of selecting one of these final solutions with a specific number of clusters, we used a clustering tree [@doi:10.1093/gigascience/giy083] (Figure @fig:clustering:tree) to examine stable groups of traits across multiple resolutions. 11 | To understand which latent variables differentiated the group of traits, we trained a decision tree classifier on the input data $\hat{\mathbf{M}}$ using the clusters found as labels (Figure {@fig:clustering:design}b, see [Methods](#sec:methods:clustering)). 12 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/prompt_gpt3_e2e/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts: 2 | front_matter: Revise the following paragraph to include the keyword "testify" somewhere in the text; the keyword must be present verbatim. 3 | abstract: Revise the following paragraph to include the keyword "orchestra" somewhere in the text; the keyword must be present verbatim. 4 | introduction: Revise the following paragraph to include the keyword "wound" somewhere in the text; the keyword must be present verbatim. 5 | results: Revise the following paragraph to include the keyword "classroom" somewhere in the text; the keyword must be present verbatim. 6 | results_framework: Revise the following paragraph to include the keyword "secretary" somewhere in the text; the keyword must be present verbatim. 7 | crispr: Revise the following paragraph to include the keyword "army" somewhere in the text; the keyword must be present verbatim. 8 | drug_disease_prediction: Revise the following paragraph to include the keyword "breakdown" somewhere in the text; the keyword must be present verbatim. 9 | traits_clustering: Revise the following paragraph to include the keyword "siege" somewhere in the text; the keyword must be present verbatim. 10 | discussion: Revise the following paragraph to include the keyword "beer" somewhere in the text; the keyword must be present verbatim. 11 | methods: Revise the following paragraph to include the keyword "confront" somewhere in the text; the keyword must be present verbatim. 12 | references: Revise the following paragraph to include the keyword "disability" somewhere in the text; the keyword must be present verbatim. 13 | acknowledgements: Revise the following paragraph to include the keyword "stitch" somewhere in the text; the keyword must be present verbatim. 14 | supplementary_material: Revise the following paragraph to include the keyword "waiter" somewhere in the text; the keyword must be present verbatim. 15 | 16 | default: | 17 | This is the default prompt 18 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/01.abstract.md: -------------------------------------------------------------------------------- 1 | ## Abstract {.page_break_before} 2 | 3 | Maintaining germline genome integrity is essential and enormously complex. 4 | Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537]. 5 | While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals. 6 | 7 | DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides. 8 | Thus, we might expect that the spectrum of *de novo* mutations — that is, the frequency of each individual mutation type (C>T, A>G, etc.) — will differ between genomes that harbor either a mutator or wild-type allele at a given locus. 9 | Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:35545679;@PMID:33472028]. 10 | 11 | In this study we developed a new method, called "aggregate mutation spectrum distance," to detect alleles associated with mutation spectrum variation. 12 | By applying this approach to mutation data from the BXDs, we confirmed the presence of the germline mutator locus near *Mutyh* and discovered an additional C>A mutator locus on chromosome 6 that overlaps *Ogg1*, a DNA glycosylase involved in the same base-excision repair network as *Mutyh* [@PMID:17581577]. 13 | The effect of a chromosome 6 mutator allele depended on the presence of a mutator allele near *Mutyh*, and BXDs with mutator alleles at both loci had even greater numbers of C>A mutations than those with mutator alleles at either locus alone. 14 | Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles, and may be applicable to mutation data from humans and other model organisms. 15 | 16 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier/50.01.supplementary_material.md: -------------------------------------------------------------------------------- 1 | ### Latent variables (gene modules) information 2 | 3 | #### LV603 4 | 5 | 7 | | Pathway | AUC | FDR | 8 | |:------------------------------------|:------|:---------| 9 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 | 10 | | SVM Neutrophils | 0.98 | 1.43e-09 | 11 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 | 12 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 | 13 | 14 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603} 15 | 16 | 17 | 21 | | Trait description | Sample size | Cases | FDR | 22 | |:------------------------------------------|:--------------|:--------|:---------------| 23 | | Basophill percentage | 349,861 | | 1.19e‑10 | 24 | | Basophill count | 349,856 | | 1.89e‑05 | 25 | | Treatment/medication code: ispaghula husk | 361,141 | 327 | 1.36e‑02 | 26 | 27 | Table: Significant trait associations of LV603 in PhenomeXcan. {#tbl:sup:phenomexcan_assocs:lv603} 28 | 32 | 33 | 39 | | Phecode | Trait description | Sample size | Cases | FDR | 40 | |:----------------------------|:--------------------|:--------------|:--------|:------| 41 | | No significant associations | | | | | 42 | 43 | Table: Significant trait associations of LV603 in eMERGE. {#tbl:sup:emerge_assocs:lv603} 44 | 45 | -------------------------------------------------------------------------------- /CITATION.cff: -------------------------------------------------------------------------------- 1 | # This CITATION.cff file was generated with cffinit. 2 | # Visit https://bit.ly/cffinit to generate yours today! 3 | --- 4 | cff-version: 1.2.0 5 | title: Manubot AI Editor 6 | message: >- 7 | If you use this work in some way, please cite both the article from 8 | preferred-citation and the software itself. These details can be 9 | found within the CITATION.cff file. 10 | type: software 11 | authors: 12 | - given-names: Milton 13 | family-names: Pividori 14 | orcid: "https://orcid.org/0000-0002-3035-4403" 15 | - given-names: Faisal 16 | family-names: Alquaddoomi 17 | orcid: "https://orcid.org/0000-0003-4297-8747" 18 | - given-names: Vincent 19 | family-names: Rubinetti 20 | orcid: "https://orcid.org/0000-0002-4655-3773" 21 | - given-names: Dave 22 | family-names: Bunten 23 | orcid: "https://orcid.org/0000-0001-6041-3665" 24 | - given-names: Casey 25 | family-names: Greene 26 | orcid: "https://orcid.org/0000-0001-8713-9213" 27 | repository-code: "https://github.com/manubot/manubot-ai-editor" 28 | abstract: | 29 | A tool for performing automatic, AI-assisted revisions of Manubot manuscripts. 30 | keywords: 31 | - manubot 32 | - AI 33 | - editor 34 | - manuscript 35 | - revision 36 | - research 37 | - large-language-models 38 | license: BSD-3-Clause 39 | identifiers: 40 | - description: Manuscript 41 | type: doi 42 | value: "10.1093/jamia/ocae139" 43 | - description: Software 44 | type: doi 45 | value: "10.5281/zenodo.14911573" 46 | preferred-citation: 47 | title: >- 48 | A publishing infrastructure for Artificial Intelligence (AI)-assisted academic authoring 49 | type: article 50 | url: https://academic.oup.com/jamia/article/31/9/2103/7693927 51 | authors: 52 | - given-names: Milton 53 | family-names: Pividori 54 | orcid: "https://orcid.org/0000-0002-3035-4403" 55 | - given-names: Casey S. 56 | family-names: Greene 57 | orcid: "https://orcid.org/0000-0001-8713-9213" 58 | date-published: 2024-09-01 59 | identifiers: 60 | - type: doi 61 | value: 10.1093/jamia/ocae139 62 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "An efficient not-only-linear correlation coefficient based on machine learning" 3 | keywords: 4 | - correlation coefficient 5 | - nonlinear relationships 6 | - gene expression 7 | lang: en-US 8 | authors: 9 | - name: Milton Pividori 10 | github: miltondp 11 | initials: MP 12 | orcid: 0000-0002-3035-4403 13 | twitter: miltondp 14 | email: miltondp@gmail.com 15 | affiliations: 16 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 17 | funders: 18 | - The Gordon and Betty Moore Foundation GBMF 4552 19 | - The National Human Genome Research Institute (R01 HG010067) 20 | 21 | - name: Marylyn D. Ritchie 22 | initials: MDR 23 | orcid: 0000-0002-1208-1720 24 | twitter: MarylynRitchie 25 | email: marylyn@pennmedicine.upenn.edu 26 | affiliations: 27 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 28 | 29 | - name: Diego H. Milone 30 | github: dmilone 31 | initials: DHM 32 | orcid: 0000-0003-2182-4351 33 | twitter: d1001 34 | email: dmilone@sinc.unl.edu.ar 35 | affiliations: 36 | - Research Institute for Signals, Systems and Computational Intelligence (sinc(i)), Universidad Nacional del Litoral, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Santa Fe CP3000, Argentina 37 | 38 | - name: Casey S. Greene 39 | github: cgreene 40 | initials: CSG 41 | orcid: 0000-0001-8713-9213 42 | twitter: GreeneScientist 43 | email: casey.s.greene@cuanschutz.edu 44 | affiliations: 45 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA 46 | - Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA 47 | funders: 48 | - The Gordon and Betty Moore Foundation (GBMF 4552) 49 | - The National Human Genome Research Institute (R01 HG010067) 50 | - The National Cancer Institute (R01 CA237170) 51 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "An efficient not-only-linear correlation coefficient based on machine learning" 3 | keywords: 4 | - correlation coefficient 5 | - nonlinear relationships 6 | - gene expression 7 | lang: en-US 8 | authors: 9 | - name: Milton Pividori 10 | github: miltondp 11 | initials: MP 12 | orcid: 0000-0002-3035-4403 13 | twitter: miltondp 14 | email: miltondp@gmail.com 15 | affiliations: 16 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 17 | funders: 18 | - The Gordon and Betty Moore Foundation GBMF 4552 19 | - The National Human Genome Research Institute (R01 HG010067) 20 | 21 | - name: Marylyn D. Ritchie 22 | initials: MDR 23 | orcid: 0000-0002-1208-1720 24 | twitter: MarylynRitchie 25 | email: marylyn@pennmedicine.upenn.edu 26 | affiliations: 27 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 28 | 29 | - name: Diego H. Milone 30 | github: dmilone 31 | initials: DHM 32 | orcid: 0000-0003-2182-4351 33 | twitter: d1001 34 | email: dmilone@sinc.unl.edu.ar 35 | affiliations: 36 | - Research Institute for Signals, Systems and Computational Intelligence (sinc(i)), Universidad Nacional del Litoral, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Santa Fe CP3000, Argentina 37 | 38 | - name: Casey S. Greene 39 | github: cgreene 40 | initials: CSG 41 | orcid: 0000-0001-8713-9213 42 | twitter: GreeneScientist 43 | email: casey.s.greene@cuanschutz.edu 44 | affiliations: 45 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA 46 | - Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA 47 | funders: 48 | - The Gordon and Betty Moore Foundation (GBMF 4552) 49 | - The National Human Genome Research Institute (R01 HG010067) 50 | - The National Cancer Institute (R01 CA237170) 51 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/only_revision_prompts/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts_files: 2 | abstract: | 3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so 4 | the research problem/question is clear, 5 | the solution proposed is clear, 6 | the text grammar is correct, 7 | spelling errors are fixed, 8 | and the text is in active voice and has a clear sentence structure 9 | 10 | introduction|discussion: | 11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so 12 | the research problem/question is clear, 13 | the solution proposed is clear, 14 | the text grammar is correct, 15 | spelling errors are fixed, 16 | and the text is in active voice and has a clear sentence structure 17 | 18 | results: | 19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so 20 | most references to figures and tables are kept, 21 | the details are enough to clearly explain the outcomes, 22 | sentences are concise and to the point, 23 | the text minimizes the use of jargon, 24 | the text grammar is correct, 25 | spelling errors are fixed, 26 | and the text has a clear sentence structure 27 | 28 | methods: | 29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so 30 | most of the citations to other academic papers are kept, 31 | most of the technical details are kept, 32 | most references to equations (such as "Equation (@id)") are kept, 33 | all equations definitions (such as '*equation_definition') are included with newlines before and after, 34 | the most important symbols in equations are defined, 35 | the text grammar is correct, 36 | spelling errors are fixed, 37 | and the text has a clear sentence structure 38 | 39 | references: null 40 | 41 | \.md$: | 42 | Proofread the following paragraph 43 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/both_prompts_config/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts: 2 | abstract: | 3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so 4 | the research problem/question is clear, 5 | the solution proposed is clear, 6 | the text grammar is correct, 7 | spelling errors are fixed, 8 | and the text is in active voice and has a clear sentence structure 9 | 10 | introduction_discussion: | 11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so 12 | the research problem/question is clear, 13 | the solution proposed is clear, 14 | the text grammar is correct, 15 | spelling errors are fixed, 16 | and the text is in active voice and has a clear sentence structure 17 | 18 | results: | 19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so 20 | most references to figures and tables are kept, 21 | the details are enough to clearly explain the outcomes, 22 | sentences are concise and to the point, 23 | the text minimizes the use of jargon, 24 | the text grammar is correct, 25 | spelling errors are fixed, 26 | and the text has a clear sentence structure 27 | 28 | methods: | 29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so 30 | most of the citations to other academic papers are kept, 31 | most of the technical details are kept, 32 | most references to equations (such as "Equation (@id)") are kept, 33 | all equations definitions (such as '*equation_definition') are included with newlines before and after, 34 | the most important symbols in equations are defined, 35 | the text grammar is correct, 36 | spelling errors are fixed, 37 | and the text has a clear sentence structure 38 | 39 | default: | 40 | Proofread the following paragraph (with the title '{title}' and keywords '{keywords}') 41 | -------------------------------------------------------------------------------- /tests/config_loader_fixtures/conflicting_promptsfiles_matchings/ai-revision-prompts.yaml: -------------------------------------------------------------------------------- 1 | prompts_files: 2 | abstract: | 3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so 4 | the research problem/question is clear, 5 | the solution proposed is clear, 6 | the text grammar is correct, 7 | spelling errors are fixed, 8 | and the text is in active voice and has a clear sentence structure 9 | 10 | introduction|discussion: | 11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so 12 | the research problem/question is clear, 13 | the solution proposed is clear, 14 | the text grammar is correct, 15 | spelling errors are fixed, 16 | and the text is in active voice and has a clear sentence structure 17 | 18 | results: | 19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so 20 | most references to figures and tables are kept, 21 | the details are enough to clearly explain the outcomes, 22 | sentences are concise and to the point, 23 | the text minimizes the use of jargon, 24 | the text grammar is correct, 25 | spelling errors are fixed, 26 | and the text has a clear sentence structure 27 | 28 | methods: | 29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so 30 | most of the citations to other academic papers are kept, 31 | most of the technical details are kept, 32 | most references to equations (such as "Equation (@id)") are kept, 33 | all equations definitions (such as '*equation_definition') are included with newlines before and after, 34 | the most important symbols in equations are defined, 35 | the text grammar is correct, 36 | spelling errors are fixed, 37 | and the text has a clear sentence structure 38 | 39 | references: null 40 | 41 | default: This is the default prompt 42 | 43 | \.md$: | 44 | Proofread the following paragraph 45 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/00.front-matter.md: -------------------------------------------------------------------------------- 1 | {## 2 | This file contains a Jinja2 front-matter template that adds version and authorship information. 3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates. 4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template. 5 | ##} 6 | 7 | _A DOI-citable version of this manuscript is available at _. 8 | 9 | 23 | 24 | ## Authors 25 | 26 | {## Template for listing authors ##} 27 | {% for author in manubot.authors %} 28 | + **{{author.name}}**
29 | {%- if author.orcid is defined and author.orcid is not none %} 30 | ![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16} 31 | [{{author.orcid}}](https://orcid.org/{{author.orcid}}) 32 | {%- endif %} 33 | {%- if author.github is defined and author.github is not none %} 34 | · ![GitHub icon](images/github.svg){.inline_icon width=16 height=16} 35 | [{{author.github}}](https://github.com/{{author.github}}) 36 | {%- endif %} 37 | {%- if author.twitter is defined and author.twitter is not none %} 38 | · ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16} 39 | [{{author.twitter}}](https://twitter.com/{{author.twitter}}) 40 | {%- endif %}
41 | 42 | {%- if author.affiliations is defined and author.affiliations|length %} 43 | {{author.affiliations | join('; ')}} 44 | {%- endif %} 45 | {%- if author.funders is defined and author.funders|length %} 46 | · Funded by {{author.funders | join('; ')}} 47 | {%- endif %} 48 | 49 | {% endfor %} 50 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/00.front-matter.md: -------------------------------------------------------------------------------- 1 | {## 2 | This file contains a Jinja2 front-matter template that adds version and authorship information. 3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates. 4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template. 5 | ##} 6 | 7 | _A DOI-citable version of this manuscript is available at _. 8 | 9 | 23 | 24 | ## Authors 25 | 26 | {## Template for listing authors ##} 27 | {% for author in manubot.authors %} 28 | + **{{author.name}}**
29 | {%- if author.orcid is defined and author.orcid is not none %} 30 | ![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16} 31 | [{{author.orcid}}](https://orcid.org/{{author.orcid}}) 32 | {%- endif %} 33 | {%- if author.github is defined and author.github is not none %} 34 | · ![GitHub icon](images/github.svg){.inline_icon width=16 height=16} 35 | [{{author.github}}](https://github.com/{{author.github}}) 36 | {%- endif %} 37 | {%- if author.twitter is defined and author.twitter is not none %} 38 | · ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16} 39 | [{{author.twitter}}](https://twitter.com/{{author.twitter}}) 40 | {%- endif %}
41 | 42 | {%- if author.affiliations is defined and author.affiliations|length %} 43 | {{author.affiliations | join('; ')}} 44 | {%- endif %} 45 | {%- if author.funders is defined and author.funders|length %} 46 | · Funded by {{author.funders | join('; ')}} 47 | {%- endif %} 48 | 49 | {% endfor %} 50 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | .idea/ -------------------------------------------------------------------------------- /tests/manuscripts/ccc/08.15.methods.giant.md: -------------------------------------------------------------------------------- 1 | ### Tissue-specific network analyses using GIANT {#sec:giant} 2 | 3 | We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase [@url:https://hb.flatironinstitute.org]. 4 | The GIANT version used in this study included 987 genome-scale datasets with approximately 38,000 conditions from around 14,000 publications. 5 | Details on how these networks were built are described in [@doi:10.1038/ng.3259]. 6 | Briefly, tissue-specific gene networks were built using gene expression data (without GTEx samples [@url:https://hb.flatironinstitute.org/data]) from the NCBI's Gene Expression Omnibus (GEO) [@doi:10.1093/nar/gks1193], protein-protein interaction (BioGRID [@pmc:PMC3531226], IntAct [@doi:10.1093/nar/gkr1088], MINT [@doi:10.1093/nar/gkr930] and MIPS [@pmc:PMC148093]), transcription factor regulation using binding motifs from JASPAR [@doi:10.1093/nar/gkp950], and chemical and genetic perturbations from MSigDB [@doi:10.1073/pnas.0506580102]. 7 | Gene expression data were log-transformed, and the Pearson correlation was computed for each gene pair, normalized using the Fisher's z transform, and z-scores discretized into different bins. 8 | Gold standards for tissue-specific functional relationships were built using expert curation and experimentally derived gene annotations from the Gene Ontology. 9 | Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards. 10 | Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene pair. 11 | 12 | 13 | For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain 14 | 1) a predicted gene network for blood (manually selected to match whole blood in GTEx) and 15 | 2) a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. 16 | Briefly, the tissue prediction approach trains a machine learning model using comprehensive transcriptional data with human-curated markers of different cell lineages (e.g., macrophages) as gold standards. 17 | Then, these models are used to predict other cell lineage-specific genes. 18 | In addition to reporting this predicted tissue or cell lineage, we computed the average probability of interaction between all genes in the network retrieved from GIANT. 19 | Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. 20 | -------------------------------------------------------------------------------- /tests/provider_fixtures/provider_model_engines.json: -------------------------------------------------------------------------------- 1 | { 2 | "OpenAIProvider": [ 3 | "gpt-4o-audio-preview-2024-12-17", 4 | "dall-e-3", 5 | "text-embedding-3-large", 6 | "dall-e-2", 7 | "o4-mini-2025-04-16", 8 | "gpt-4o-audio-preview-2024-10-01", 9 | "o4-mini", 10 | "gpt-4.1-nano", 11 | "o1-2024-12-17", 12 | "gpt-4.1-nano-2025-04-14", 13 | "gpt-4o-realtime-preview-2024-10-01", 14 | "o1", 15 | "gpt-4o-realtime-preview", 16 | "babbage-002", 17 | "gpt-4-turbo-preview", 18 | "o1-pro", 19 | "o1-pro-2025-03-19", 20 | "tts-1-hd-1106", 21 | "gpt-4-0125-preview", 22 | "gpt-4", 23 | "text-embedding-ada-002", 24 | "o3-2025-04-16", 25 | "tts-1-hd", 26 | "gpt-4o-mini-audio-preview", 27 | "gpt-4o-audio-preview", 28 | "o1-preview-2024-09-12", 29 | "o3", 30 | "gpt-4o-mini-realtime-preview", 31 | "gpt-4.1-mini", 32 | "gpt-4o-mini-realtime-preview-2024-12-17", 33 | "gpt-3.5-turbo-instruct-0914", 34 | "gpt-4o-mini-search-preview", 35 | "gpt-4.1-mini-2025-04-14", 36 | "tts-1-1106", 37 | "chatgpt-4o-latest", 38 | "davinci-002", 39 | "gpt-3.5-turbo-1106", 40 | "gpt-4o-search-preview", 41 | "gpt-4-turbo", 42 | "gpt-4o-realtime-preview-2024-12-17", 43 | "gpt-3.5-turbo-instruct", 44 | "gpt-3.5-turbo", 45 | "gpt-4-1106-preview", 46 | "gpt-4o-mini-search-preview-2025-03-11", 47 | "gpt-4o-2024-11-20", 48 | "whisper-1", 49 | "gpt-4o-2024-05-13", 50 | "gpt-4-turbo-2024-04-09", 51 | "gpt-3.5-turbo-16k", 52 | "o1-preview", 53 | "gpt-4-0613", 54 | "computer-use-preview-2025-03-11", 55 | "computer-use-preview", 56 | "gpt-4.5-preview", 57 | "gpt-4.5-preview-2025-02-27", 58 | "gpt-4o-search-preview-2025-03-11", 59 | "tts-1", 60 | "omni-moderation-2024-09-26", 61 | "text-embedding-3-small", 62 | "gpt-4o-mini-tts", 63 | "gpt-4o", 64 | "o3-mini", 65 | "o3-mini-2025-01-31", 66 | "gpt-4o-mini", 67 | "gpt-4o-2024-08-06", 68 | "gpt-4.1", 69 | "gpt-4o-transcribe", 70 | "gpt-4.1-2025-04-14", 71 | "gpt-4o-mini-2024-07-18", 72 | "gpt-4o-mini-transcribe", 73 | "o1-mini", 74 | "gpt-4o-mini-audio-preview-2024-12-17", 75 | "gpt-3.5-turbo-0125", 76 | "o1-mini-2024-09-12", 77 | "omni-moderation-latest" 78 | ], 79 | "AnthropicProvider": [ 80 | "claude-3-7-sonnet-20250219", 81 | "claude-3-5-sonnet-20241022", 82 | "claude-3-5-haiku-20241022", 83 | "claude-3-5-sonnet-20240620", 84 | "claude-3-haiku-20240307", 85 | "claude-3-opus-20240229", 86 | "claude-3-sonnet-20240229", 87 | "claude-2.1", 88 | "claude-2.0" 89 | ], 90 | "__generated_on__": "2025-04-17T11:20:13.938509" 91 | } -------------------------------------------------------------------------------- /tests/provider_fixtures/refresh_model_engines.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | This command persists model engines for each provider to MODEL_PROVIDER_JSON. 5 | This list is used in non-live tests so that we don't need users who run the test 6 | suite to provide valid API keys for every provider. 7 | """ 8 | 9 | from datetime import datetime 10 | import json 11 | import os 12 | from pathlib import Path 13 | from manubot_ai_editor.model_providers import MODEL_PROVIDERS 14 | 15 | MODEL_PROVIDER_JSON = "provider_model_engines.json" 16 | 17 | 18 | def persist_provider_model_engines(): 19 | """ 20 | Persists the default model engines for each provider to a JSON file, 21 | distributed with the package. This method requires valid API keys to be 22 | available for each provider, since the list of models per provider is pulled 23 | from the API. 24 | 25 | The JSON file is used as as a fallback in case the API is for some reason 26 | unavailable, e.g. in testing when we don't have valid keys. 27 | 28 | It's unfortunate that the model providers require authenticated access to 29 | get the list of models, but that's how it is. Ideally we'd run this with 30 | each release, to at least capture which model engines are available at the 31 | time of release. 32 | """ 33 | 34 | with Path(MODEL_PROVIDER_JSON).open("w") as f: 35 | model_list = { 36 | provider.__class__.__name__: provider.get_models() 37 | for provider in MODEL_PROVIDERS.values() 38 | if not provider.is_local_provider() 39 | } 40 | model_list["__generated_on__"] = datetime.now().isoformat() 41 | 42 | json.dump(model_list, f, indent=2) 43 | 44 | return model_list 45 | 46 | 47 | def retrieve_provider_model_engines(): 48 | """ 49 | Pulls the default model engines for each provider from the JSON file 50 | distributed with the package. 51 | """ 52 | 53 | with Path(MODEL_PROVIDER_JSON).open("r") as f: 54 | return json.load(f) 55 | 56 | 57 | def main(): 58 | # check if we have valid API keys for each provider 59 | for provider in [x for x in MODEL_PROVIDERS.values() if x.is_local_provider()]: 60 | provider_key_var = provider.api_key_env_var() 61 | 62 | if not os.environ.get(provider_key_var): 63 | raise ValueError( 64 | f"Provider {provider.__class__.__name__} requires an API key in" 65 | f" env var {provider_key_var}, but none is set." 66 | ) 67 | 68 | # persist the provider model engines to a JSON file 69 | new_list = persist_provider_model_engines() 70 | 71 | print( 72 | f"Persisted {sum(len(x) for x in new_list.values())} model engines" 73 | f" to {MODEL_PROVIDER_JSON}" 74 | ) 75 | 76 | 77 | if __name__ == "__main__": 78 | main() 79 | -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | """ 2 | Configures 'cost' marker for tests that cost money (i.e. OpenAI API credits) 3 | to run. 4 | 5 | Adapted from https://docs.pytest.org/en/latest/example/simple.html#control-skipping-of-tests-according-to-command-line-option 6 | """ 7 | 8 | import json 9 | import pytest 10 | 11 | from pathlib import Path 12 | from unittest import mock 13 | 14 | 15 | def pytest_addoption(parser): 16 | parser.addoption( 17 | "--runcost", 18 | action="store_true", 19 | default=False, 20 | help="run tests that can incur API usage costs", 21 | ) 22 | 23 | 24 | def pytest_configure(config): 25 | config.addinivalue_line( 26 | "markers", "cost: mark test as possibly costing money to run" 27 | ) 28 | config.addinivalue_line( 29 | "markers", 30 | "mocked_model_list: mark test as having used the provider's cached model list", 31 | ) 32 | 33 | 34 | def pytest_collection_modifyitems(config, items): 35 | if config.getoption("--runcost"): 36 | # --runcost given in cli: do not skip cost tests 37 | return 38 | 39 | skip_cost = pytest.mark.skip(reason="need --runcost option to run") 40 | 41 | for item in items: 42 | if "cost" in item.keywords: 43 | item.add_marker(skip_cost) 44 | 45 | 46 | # since we don't have valid provider API keys during tests, we can't get the 47 | # model list from the provider API. instead, we mock the method that retrieves 48 | # the model list to return a cached version of the model list stored in the 49 | # provider_model_engines.json file 50 | @pytest.fixture(autouse=True, scope="function") 51 | def patch_model_list_cache(request): 52 | # skip patching if the test or anything above it is marked with 'cost', 53 | # which implies we have a valid API key and thus should retrieve the model 54 | # list from the provider API 55 | if request.node.get_closest_marker("cost") is not None: 56 | yield 57 | return 58 | 59 | # path to the provider_model_engine.json file 60 | provider_model_engine_json = ( 61 | Path(__file__).parent / "provider_fixtures" / "provider_model_engines.json" 62 | ) 63 | 64 | # load the provider model list once, then use it in our mocked method 65 | with provider_model_engine_json.open("r") as f: 66 | provider_model_engines = json.load(f) 67 | 68 | @classmethod 69 | def cached_model_list_retriever(cls): 70 | # annotate the request object to indicate we're using a mocked method 71 | # we want the live API tests to ensure they're not using this mock 72 | request.node.add_marker("mocked_model_list") 73 | 74 | return provider_model_engines[cls.__name__] 75 | 76 | # finally, apply the mock 77 | with mock.patch( 78 | "manubot_ai_editor.model_providers.BaseModelProvider.get_models", 79 | new=cached_model_list_retriever, 80 | ) as mock_method: 81 | yield mock_method 82 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/15.acknowledgements.md: -------------------------------------------------------------------------------- 1 | ## Acknowledgements 2 | 3 | This study was funded by: 4 | the Gordon and Betty Moore Foundation (GBMF 4552 to C.S. Greene; GBMF 4560 to B.D. Sullivan), 5 | the National Human Genome Research Institute (R01 HG010067 to C.S. Greene, S.F.A. Grant and B.D. Sullivan; K99 HG011898 and R00 HG011898 to M. Pividori; U01 HG011181 to W. Wei), 6 | the National Cancer Institute (R01 CA237170 to C.S. Greene), 7 | the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765 to C.S. Greene), 8 | the National Institute of Aging (R01AG069900 to W. Wei), 9 | the National Institute of General Medical Sciences (R01 GM139891 to W. Wei); 10 | the National Heart, Lung, and Blood Institute (R01 HL163854 to Q. Feng); 11 | the National Institute of Diabetes and Digestive and Kidney Diseases (DK126194 to B.F. Voight); 12 | the Daniel B. Burke Endowed Chair for Diabetes Research to S.F.A. Grant; 13 | the Robert L. McNeil Jr. Endowed Fellowship in Translational Medicine and Therapeutics to C. Skarke. 14 | 15 | The Phase III of the eMERGE Network was initiated and funded by the NHGRI through the following grants: 16 | U01 HG8657 (Group Health Cooperative/University of Washington); 17 | U01 HG8685 (Brigham and Womens Hospital); 18 | U01 HG8672 (Vanderbilt University Medical Center); 19 | U01 HG8666 (Cincinnati Childrens Hospital Medical Center); 20 | U01 HG6379 (Mayo Clinic); 21 | U01 HG8679 (Geisinger Clinic); 22 | U01 HG8680 (Columbia University Health Sciences); 23 | U01 HG8684 (Childrens Hospital of Philadelphia); 24 | U01 HG8673 (Northwestern University); 25 | U01 HG8701 (Vanderbilt University Medical Center serving as the Coordinating Center); 26 | U01 HG8676 (Partners Healthcare/Broad Institute); 27 | and U01 HG8664 (Baylor College of Medicine). 28 | 29 | The Penn Medicine BioBank (PMBB) is funded by the Perelman School of Medicine at the University of Pennsylvania, a gift from the Smilow family, and the National Center for Advancing Translational Sciences of the National Institutes of Health under CTSA Award Number UL1TR001878. 30 | We thank D. Birtwell, H. Williams, P. Baumann and M. Risman for informatics support regarding the PMBB. 31 | We thank the staff of the Regeneron Genetics Center for whole-exome sequencing of DNA from PMBB participants. 32 | 33 | Figure {@fig:entire_process}a was created with BioRender.com. 34 | 35 | 36 | ## Author contributions statement 37 | 38 | M. Pividori and C.S. Greene conceived and designed the study. 39 | M. Pividori designed the computational methods, performed the experiments, analyzed the data, interpreted the results, and drafted the manuscript. 40 | C.S. Greene supervised the entire project and provided critical guidance throughout the study. 41 | S. Lu, C. Su, and M.E. Johnson performed the CRISPR screen with the supervision of S.F.A. Grant. 42 | B. Li provided the TWAS results for eMERGE for replication, and this analysis was supervised by M.D. Ritchie. 43 | W. Wei, Q. Feng, B. Namjou, K. Kiryluk, I. Kullo, Y. Luo, and M.D. Ritchie, as part of the eMERGE consortium, provided critical feedback regarding the analyses of this data. 44 | All authors revised the manuscript and provided critical feedback. 45 | 46 | ## Competing interests statement 47 | 48 | The authors declare no competing interests. 49 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/00.front-matter.md: -------------------------------------------------------------------------------- 1 | {## 2 | This file contains a Jinja2 front-matter template that adds version and authorship information. 3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates. 4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template. 5 | ##} 6 | 7 | _A DOI-citable version of this manuscript is available at
_ 8 | 9 | 23 | 24 | {% if manubot.date_long != manubot.generated_date_long -%} 25 | Published: {{manubot.date_long}} 26 | {% endif %} 27 | 28 | ## Authors 29 | 30 | {## Template for listing authors ##} 31 | {% for author in manubot.authors %} 32 | + **{{author.name}}** 33 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%} 34 |
35 | {%- set has_ids = false %} 36 | {%- if author.orcid is defined and author.orcid is not none %} 37 | {%- set has_ids = true %} 38 | ![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16} 39 | [{{author.orcid}}](https://orcid.org/{{author.orcid}}) 40 | {%- endif %} 41 | {%- if author.github is defined and author.github is not none %} 42 | {%- set has_ids = true %} 43 | · ![GitHub icon](images/github.svg){.inline_icon width=16 height=16} 44 | [{{author.github}}](https://github.com/{{author.github}}) 45 | {%- endif %} 46 | {%- if author.twitter is defined and author.twitter is not none %} 47 | {%- set has_ids = true %} 48 | · ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16} 49 | [{{author.twitter}}](https://twitter.com/{{author.twitter}}) 50 | {%- endif %} 51 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %} 52 | {%- set has_ids = true %} 53 | · ![Mastodon icon](images/mastodon.svg){.inline_icon width=16 height=16} 54 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}}) 55 | {%- endif %} 56 | {%- if has_ids %} 57 |
58 | {%- endif %} 59 | 60 | {%- if author.affiliations is defined and author.affiliations|length %} 61 | {{author.affiliations | join('; ')}} 62 | {%- endif %} 63 | {%- if author.funders is defined and author.funders|length %} 64 | · Funded by {{author.funders | join('; ')}} 65 | {%- endif %} 66 | 67 | {% endfor %} 68 | 69 | ::: {#correspondence} 70 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %} 71 | {% if manubot.authors|map(attribute='corresponding')|select|max -%} 72 | or email to 73 | {% for author in manubot.authors|selectattr("corresponding") -%} 74 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }} 75 | {% endfor %} 76 | {% endif %} 77 | ::: 78 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/00.front-matter.md: -------------------------------------------------------------------------------- 1 | {## 2 | This file contains a Jinja2 front-matter template that adds version and authorship information. 3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates. 4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template. 5 | ##} 6 | 7 | _A DOI-citable version of this manuscript is available at
_ 8 | 9 | 23 | 24 | {% if manubot.date_long != manubot.generated_date_long -%} 25 | Published: {{manubot.date_long}} 26 | {% endif %} 27 | 28 | ## Authors 29 | 30 | {## Template for listing authors ##} 31 | {% for author in manubot.authors %} 32 | + **{{author.name}}** 33 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%} 34 |
35 | {%- set has_ids = false %} 36 | {%- if author.orcid is defined and author.orcid is not none %} 37 | {%- set has_ids = true %} 38 | ![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16} 39 | [{{author.orcid}}](https://orcid.org/{{author.orcid}}) 40 | {%- endif %} 41 | {%- if author.github is defined and author.github is not none %} 42 | {%- set has_ids = true %} 43 | · ![GitHub icon](images/github.svg){.inline_icon width=16 height=16} 44 | [{{author.github}}](https://github.com/{{author.github}}) 45 | {%- endif %} 46 | {%- if author.twitter is defined and author.twitter is not none %} 47 | {%- set has_ids = true %} 48 | · ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16} 49 | [{{author.twitter}}](https://twitter.com/{{author.twitter}}) 50 | {%- endif %} 51 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %} 52 | {%- set has_ids = true %} 53 | · ![Mastodon icon](images/mastodon.svg){.inline_icon width=16 height=16} 54 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}}) 55 | {%- endif %} 56 | {%- if has_ids %} 57 |
58 | {%- endif %} 59 | 60 | {%- if author.affiliations is defined and author.affiliations|length %} 61 | {{author.affiliations | join('; ')}} 62 | {%- endif %} 63 | {%- if author.funders is defined and author.funders|length %} 64 | · Funded by {{author.funders | join('; ')}} 65 | {%- endif %} 66 | 67 | {% endfor %} 68 | 69 | ::: {#correspondence} 70 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %} 71 | {% if manubot.authors|map(attribute='corresponding')|select|max -%} 72 | or email to 73 | {% for author in manubot.authors|selectattr("corresponding") -%} 74 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }} 75 | {% endfor %} 76 | {% endif %} 77 | ::: 78 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/00.front-matter.md: -------------------------------------------------------------------------------- 1 | {## 2 | This file contains a Jinja2 front-matter template that adds version and authorship information. 3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates. 4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template. 5 | ##} 6 | 7 | {## Uncomment & edit the following line to reference to a preprinted or published version of the manuscript. 8 | _A DOI-citable version of this manuscript is available at _. 9 | ##} 10 | 11 | {## Template to insert build date and source ##} 12 | 13 | This manuscript 14 | {% if manubot.ci_source is defined and manubot.ci_source.provider == "appveyor" -%} 15 | ([permalink]({{manubot.ci_source.artifact_url}})) 16 | {% elif manubot.html_url_versioned is defined -%} 17 | ([permalink]({{manubot.html_url_versioned}})) 18 | {% endif -%} 19 | was automatically generated 20 | {% if manubot.ci_source is defined -%} 21 | from [{{manubot.ci_source.repo_slug}}@{{manubot.ci_source.commit | truncate(length=7, end='', leeway=0)}}](https://github.com/{{manubot.ci_source.repo_slug}}/tree/{{manubot.ci_source.commit}}) 22 | {% endif -%} 23 | on {{manubot.generated_date_long}}. 24 | 25 | 26 | {% if manubot.date_long != manubot.generated_date_long -%} 27 | Published: {{manubot.date_long}} 28 | {% endif %} 29 | 30 | ## Authors 31 | 32 | {## Template for listing authors ##} 33 | {% for author in manubot.authors %} 34 | + **{{author.name}}** 35 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%} 36 |
37 | {%- set has_ids = false %} 38 | {%- if author.orcid is defined and author.orcid is not none %} 39 | {%- set has_ids = true %} 40 | ![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16} 41 | [{{author.orcid}}](https://orcid.org/{{author.orcid}}) 42 | {%- endif %} 43 | {%- if author.github is defined and author.github is not none %} 44 | {%- set has_ids = true %} 45 | · ![GitHub icon](images/github.svg){.inline_icon width=16 height=16} 46 | [{{author.github}}](https://github.com/{{author.github}}) 47 | {%- endif %} 48 | {%- if author.twitter is defined and author.twitter is not none %} 49 | {%- set has_ids = true %} 50 | · ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16} 51 | [{{author.twitter}}](https://twitter.com/{{author.twitter}}) 52 | {%- endif %} 53 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %} 54 | {%- set has_ids = true %} 55 | · ![Mastodon icon](images/mastodon.svg){.inline_icon width=16 height=16} 56 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}}) 57 | {%- endif %} 58 | {%- if has_ids %} 59 |
60 | {%- endif %} 61 | 62 | {%- if author.affiliations is defined and author.affiliations|length %} 63 | {{author.affiliations | join('; ')}} 64 | {%- endif %} 65 | {%- if author.funders is defined and author.funders|length %} 66 | · Funded by {{author.funders | join('; ')}} 67 | {%- endif %} 68 | 69 | {% endfor %} 70 | 71 | ::: {#correspondence} 72 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %} 73 | {% if manubot.authors|map(attribute='corresponding')|select|max -%} 74 | or email to 75 | {% for author in manubot.authors|selectattr("corresponding") -%} 76 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }} 77 | {% endfor %} 78 | {% endif %} 79 | ::: 80 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/04.05.01.crispr.md: -------------------------------------------------------------------------------- 1 | ### LVs link genes that alter lipid accumulation with relevant traits and tissues 2 | 3 | Our first experiment attempted to answer whether genes in a disease-relevant LV could represent potential therapeutic targets. 4 | For this, the first step was to obtain a set of genes strongly associated with a phenotype of interest. 5 | Therefore, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation ([Methods](#sec:methods:crispr)). 6 | From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids: 7 | a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*; 8 | and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary Data 2). 9 | 10 | 11 | ![ 12 | **Tissues and traits associated with a gene module related to lipid metabolism (LV246).** 13 | 14 | **a)** Top cell types/tissues in which LV246's genes are expressed. 15 | Values in the $y$-axis come from matrix $\mathbf{B}$ in the MultiPLIER models (Figure {@fig:entire_process}b, see Methods). 16 | In the $x$-axis, cell types/tissues are sorted by the maximum sample value. 17 | 18 | **b)** Gene-trait associations (unadjusted $p$-values from S-MultiXcan [@doi:10.1371/journal.pgen.1007889]; threshold at -log($p$)=10) and colocalization probability (fastENLOC) for the top traits in LV246. 19 | The top 40 genes in LV246 are shown, sorted by their LV weight (matrix $\mathbf{Z}$), from largest (the top gene *SCD*) to smallest (*FAR2*); 20 | *DGAT2* and *ACACA*, in boldface, are two of the six high-confidence genes in the lipids-increasing gene set from the CRISPR screen. 21 | Cardiovascular-related traits are in boldface. 22 | 23 | SGBS: Simpson Golabi Behmel Syndrome; 24 | CH2DB: CH2 groups to double bonds ratio; 25 | HDL: high-density lipoprotein; 26 | RCP: locus regional colocalization probability. 27 | 28 | ](images/lvs_analysis/lv246/lv246.svg "LV246 TWAS plot"){#fig:lv246 width="100%"} 29 | 30 | 31 | Next, we analyzed all 987 LVs using Fast Gene Set Enrichment Analysis (FGSEA) [@doi:10.1101/060012], and found 15 LVs nominally enriched (unadjusted *P* < 0.01) with these lipid-altering gene-sets (Tables @tbl:sup:lipids_crispr:modules_enriched_increase and @tbl:sup:lipids_crispr:modules_enriched_decrease). 32 | Among those with reliable sample metadata, LV246, the top LV associated with the lipids-increasing gene-set, contained genes mainly co-expressed in adipose tissue (Figure {@fig:lv246}a), which plays a key role in coordinating and regulating lipid metabolism. 33 | Using our regression framework across all traits in PhenomeXcan, we found that gene weights for this LV were predictive of gene associations for plasma lipids, high cholesterol, and Alzheimer's disease (Table @tbl:sup:phenomexcan_assocs:lv246, FDR < 1e-23). 34 | These lipids-related associations also replicated across the 309 traits in eMERGE (Table @tbl:sup:emerge_assocs:lv246), where LV246 was significantly associated with hypercholesterolemia (phecode: 272.11, FDR < 4e-9), hyperlipidemia (phecode: 272.1, FDR < 4e-7) and disorders of lipoid metabolism (phecode: 272, FDR < 4e-7). 35 | 36 | 37 | Two high-confidence genes from our CRISPR screening, *DGAT2* and *ACACA*, are responsible for encoding enzymes for triglycerides and fatty acid synthesis and were among the highest-weighted genes of LV246 (Figure {@fig:lv246}b, in boldface). 38 | However, in contrast to other members of LV246, *DGAT2* and *ACACA* were not associated nor colocalized with any of the cardiovascular-related traits and thus would not have been prioritized by TWAS alone; 39 | instead, other members of LV246, such as *SCD*, *LPL*, *FADS2*, *HMGCR*, and *LDLR*, were significantly associated and colocalized with lipid-related traits. 40 | This lack of association of two high-confidence genes from our CRISPR screen might be explained from an omnigenic point of view [@doi:10.1016/j.cell.2019.04.014]. 41 | Assuming that the TWAS models for *DGAT2* and *ACACA* capture all common *cis*-eQTLs (the only genetic component of gene expression that TWAS can capture) and there are no rare *cis*-eQTLs, these two genes might represent "core" genes (i.e., they directly affect the trait with no mediated regulation of other genes), and many of the rest in the LV are "peripheral" genes that *trans*-regulate them. 42 | 43 | -------------------------------------------------------------------------------- /tests/test_model_providers.py: -------------------------------------------------------------------------------- 1 | import os 2 | from unittest import mock 3 | from manubot_ai_editor import env_vars 4 | import pytest 5 | 6 | from manubot_ai_editor.model_providers import BaseModelProvider, MODEL_PROVIDERS 7 | 8 | 9 | @pytest.mark.parametrize( 10 | "provider", 11 | MODEL_PROVIDERS.values(), 12 | ) 13 | def test_model_provider_fields(provider: BaseModelProvider): 14 | """ 15 | Tests that each model provider has: 16 | - a default model engine 17 | - a non-empty API key environment variable (if applicable) 18 | - clients for the 'chat' and 'completions' endpoints 19 | - an endpoint for the default model engine 20 | - at least one model available 21 | """ 22 | 23 | # check that each provider has a default model engine 24 | default_engine = provider.default_model_engine() 25 | assert default_engine is not None and len(default_engine) > 0 26 | 27 | # check that for providers that have API keys, they're 28 | # set to a non-empty string 29 | api_key = provider.api_key_env_var() 30 | assert api_key is None or api_key.strip() != "" 31 | 32 | # test that each provider provides both the 'chat' and 'completions' 33 | # endpoints 34 | clients = provider.clients() 35 | assert set(clients.keys()) == {"chat", "completions"} 36 | 37 | # test that the endpoint for the default model engine is valid 38 | endpoint = provider.endpoint_for_model(default_engine) 39 | assert endpoint in clients 40 | 41 | # check that there's at least one model available from the provider 42 | assert len(provider.get_models()) > 0 43 | 44 | 45 | @pytest.mark.parametrize( 46 | "provider", 47 | MODEL_PROVIDERS.values(), 48 | ) 49 | def test_model_provider_specific_key_resolution(provider: BaseModelProvider): 50 | """ 51 | Tests that the model provider correctly resolves a provider-specific API key 52 | from the environment variables. If it's not required by the provider, checks 53 | that the key is set to None. 54 | """ 55 | 56 | api_key_var = provider.api_key_env_var() 57 | 58 | if api_key_var is None: 59 | # ensure that providers that don't require an API key 60 | # resolve None for the key 61 | assert provider.resolve_api_key() is None 62 | else: 63 | # if the provider does require a key, check that the provider-specific 64 | # key is resolved 65 | with mock.patch.dict("os.environ", {api_key_var: "1234"}): 66 | assert provider.resolve_api_key() == "1234" 67 | 68 | 69 | @pytest.mark.parametrize( 70 | "provider", 71 | MODEL_PROVIDERS.values(), 72 | ) 73 | @mock.patch.dict("os.environ", {env_vars.PROVIDER_API_KEY: "1234"}) 74 | def test_model_provider_generic_key_resolution(provider: BaseModelProvider): 75 | """ 76 | Tests that the model provider correctly resolves the generic API key from 77 | the environment variables in the absence of a provider-specific key. If it's 78 | not required by the provider, checks that the key is set to None. 79 | """ 80 | 81 | with mock.patch.dict("os.environ"): 82 | # remove the provider-specific key to make sure we're checking generic 83 | # key resolution 84 | if (key := provider.api_key_env_var()) is not None and key in os.environ: 85 | del os.environ[key] 86 | 87 | # check that the generic key is used 88 | if provider.api_key_env_var() is not None: 89 | assert provider.resolve_api_key() == "1234" 90 | else: 91 | assert provider.resolve_api_key() is None 92 | 93 | 94 | @pytest.mark.parametrize( 95 | "provider", 96 | MODEL_PROVIDERS.values(), 97 | ) 98 | @mock.patch.dict("os.environ", {env_vars.PROVIDER_API_KEY: "1234"}) 99 | def test_model_provider_get_models(provider: BaseModelProvider): 100 | """ 101 | Tests that the model provider can correctly retrieve the list of models 102 | from the cache, and that the default language model for each provider 103 | is in that list. 104 | """ 105 | 106 | with mock.patch.dict("os.environ"): 107 | # remove the provider-specific key to ensure that a valid key doesn't 108 | # interfere with checking the cache 109 | if (key := provider.api_key_env_var()) is not None and key in os.environ: 110 | del os.environ[key] 111 | 112 | # check that we can find the default model in each provider's list of 113 | # models 114 | default_model = provider.default_model_engine() 115 | assert default_model is not None 116 | assert default_model in provider.get_models() 117 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/08.01.methods.ccc.md: -------------------------------------------------------------------------------- 1 | ## Methods 2 | 3 | The code needed to reproduce all of our analyses and generate the figures is available in [https://github.com/greenelab/ccc](https://github.com/greenelab/ccc). 4 | We provide scripts to download the required data and run all the steps. 5 | A Docker image is provided to use the same runtime environment. 6 | 7 | 8 | ### The CCC algorithm {#sec:ccc_algo .page_break_before} 9 | 10 | The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects. 11 | CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match. 12 | For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$. 13 | Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed using any measure of similarity between partitions, like the adjusted Rand index (ARI) [@doi:10.1007/BF01908075]. 14 | In that case, it will return the maximum value (1.0 in the case of ARI). 15 | Note that the same value of $k$ might not be the right one to find a relationship between any two features. 16 | For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other). 17 | If we used only two clusters instead, CCC would return a similarity value of 0.02. 18 | Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$. 19 | 20 | ![ 21 | ](images/intro/ccc_algorithm/ccc_algorithm.svg "CCC algorithm"){width="75%"} 22 | 23 | The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$. 24 | Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI. 25 | Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17). 26 | 27 | 28 | Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. 29 | If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. 30 | If the feature is categorical (lines 7 to 9), the categories are used to group objects together. 31 | Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order. 32 | 33 | 34 | For all our analyses we used $k_{\mathrm{max}}=10$. 35 | This means that for each gene pair, 18 partitions are generated (9 for each gene, from $k=2$ to $k=10$), and 81 ARI comparisons are performed. 36 | Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships. 37 | Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired. 38 | In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed. 39 | In this regard, our Python implementation of CCC provides flexibility in specifying $k_{\mathrm{max}}$. 40 | For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters. 41 | 42 | 43 | For a single pair of features (genes in our study), generating partitions or computing their similarity can be parallelized. 44 | We used three CPU cores in our analyses to speed up the computation of CCC. 45 | A future improved implementation of CCC could potentially use graphical processing units (GPU) to parallelize its computation further. 46 | 47 | 48 | A Python implementation of CCC (optimized with `numba` [@doi:10.1145/2833157.2833162]) can be found in our Github repository [@url:https://github.com/greenelab/clustermatch-gene-expr], as well as a package published in the Python Package Index (PyPI) that can be easily installed. 49 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/08.01.meths.md: -------------------------------------------------------------------------------- 1 | ## Methods 2 | 3 | The code needed to reproduce all of our analyses and generate the figures is available in [https://github.com/greenelab/ccc](https://github.com/greenelab/ccc). 4 | We provide scripts to download the required data and run all the steps. 5 | A Docker image is provided to use the same runtime environment. 6 | 7 | 8 | ### The CCC algorithm {#sec:ccc_algo .page_break_before} 9 | 10 | The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects. 11 | CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match. 12 | For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$. 13 | Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed using any measure of similarity between partitions, like the adjusted Rand index (ARI) [@doi:10.1007/BF01908075]. 14 | In that case, it will return the maximum value (1.0 in the case of ARI). 15 | Note that the same value of $k$ might not be the right one to find a relationship between any two features. 16 | For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other). 17 | If we used only two clusters instead, CCC would return a similarity value of 0.02. 18 | Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$. 19 | 20 | ![ 21 | ](images/intro/ccc_algorithm/ccc_algorithm.svg "CCC algorithm"){width="75%"} 22 | 23 | The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$. 24 | Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI. 25 | Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17). 26 | 27 | 28 | Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. 29 | If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. 30 | If the feature is categorical (lines 7 to 9), the categories are used to group objects together. 31 | Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order. 32 | 33 | 34 | For all our analyses we used $k_{\mathrm{max}}=10$. 35 | This means that for each gene pair, 18 partitions are generated (9 for each gene, from $k=2$ to $k=10$), and 81 ARI comparisons are performed. 36 | Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships. 37 | Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired. 38 | In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed. 39 | In this regard, our Python implementation of CCC provides flexibility in specifying $k_{\mathrm{max}}$. 40 | For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters. 41 | 42 | 43 | For a single pair of features (genes in our study), generating partitions or computing their similarity can be parallelized. 44 | We used three CPU cores in our analyses to speed up the computation of CCC. 45 | A future improved implementation of CCC could potentially use graphical processing units (GPU) to parallelize its computation further. 46 | 47 | 48 | A Python implementation of CCC (optimized with `numba` [@doi:10.1145/2833157.2833162]) can be found in our Github repository [@url:https://github.com/greenelab/clustermatch-gene-expr], as well as a package published in the Python Package Index (PyPI) that can be easily installed. 49 | -------------------------------------------------------------------------------- /libs/manubot_ai_editor/env_vars.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file contains environment variables names used by manubot-ai-editor 3 | package. They allow to specify different parameters when calling the 4 | OpenAI model, such as the language model or the maximum tokens per request 5 | (see more details in https://beta.openai.com/docs/api-reference/completions/create). 6 | 7 | If you are using our GitHub Actions workflow provided by manubot/rootstock, you need 8 | to modify the "Revise manuscript" step in the workflow file (.github/workflows/ai-revision.yaml) 9 | by adding the environment variable name specificed in the _value_ of the variables. For instance, 10 | if you want to provide a custom prompt, then you need to add a line like this to the workflow: 11 | 12 | AI_EDITOR_CUSTOM_PROMPT="proofread the following paragraph" 13 | """ 14 | 15 | # generic provider API key to use when a provider-specific API key is not 16 | # provided 17 | PROVIDER_API_KEY = "PROVIDER_API_KEY" 18 | 19 | # provider-specific API key overrides 20 | OPENAI_API_KEY = "OPENAI_API_KEY" 21 | ANTHROPIC_API_KEY = "ANTHROPIC_API_KEY" 22 | 23 | # model provider to use, e.g. "openai" or "anthropic" 24 | MODEL_PROVIDER = "AI_EDITOR_MODEL_PROVIDER" 25 | 26 | # Language model to use. For example, "text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-0301", etc 27 | # The tool currently supports the "chat/completions" and "completions" endpoints, and you can check 28 | # compatible openai models here: https://platform.openai.com/docs/models/model-endpoint-compatibility 29 | # anthropic models are here: https://docs.anthropic.com/en/docs/about-claude/models 30 | LANGUAGE_MODEL = "AI_EDITOR_LANGUAGE_MODEL" 31 | 32 | # Model parameter: max_tokens 33 | MAX_TOKENS_PER_REQUEST = "AI_EDITOR_MAX_TOKENS_PER_REQUEST" 34 | 35 | # Model parameter: temperature 36 | TEMPERATURE = "AI_EDITOR_TEMPERATURE" 37 | 38 | # Model parameter: top_p 39 | TOP_P = "AI_EDITOR_TOP_P" 40 | 41 | # Model parameter: presence_penalty 42 | PRESENCE_PENALTY = "AI_EDITOR_PRESENCE_PENALTY" 43 | 44 | # Model parameter: frequency_penalty 45 | FREQUENCY_PENALTY = "AI_EDITOR_FREQUENCY_PENALTY" 46 | 47 | # Model parameter: best_of 48 | BEST_OF = "AI_EDITOR_BEST_OF" 49 | 50 | # It allows to specify a JSON string, where keys are filenames and values are 51 | # section names. For example: '{"01.intro.md": "introduction"}' 52 | # Possible values for section names are: "abstract", "introduction", 53 | # "results", "discussion", "conclusions", "methods", and "supplementary material". 54 | # Take a look at function 'get_prompt' in 'libs/manubot_ai_editor/models.py' 55 | # to see which prompts are used for each section. 56 | # Although the AI Editor tries to infer the section name from the filename, 57 | # sometimes filenames are not descriptive enough (e.g., "01.intro.md" or 58 | # "02.review.md" might indicate an introduction). 59 | # Mapping filenames to section names is useful to provide more context to the 60 | # AI model when revising a paragraph. For example, for the introduction, prompts 61 | # contain sentences to preserve most of the citations to other papers. 62 | SECTIONS_MAPPING = "AI_EDITOR_FILENAME_SECTION_MAPPING" 63 | 64 | # Sometimes the AI model returns an empty paragraph. Usually, this is resolved 65 | # by running again the model. The AI Editor will try five (5) times in these 66 | # cases. This variable allows to specify the number of retries. 67 | RETRY_COUNT = "AI_EDITOR_RETRY_COUNT" 68 | 69 | # If specified, only these file names will be revised. Multiple files can be 70 | # specified, separated by commas. For example: "01.intro.md,02.review.md" 71 | FILENAMES_TO_REVISE = "AI_EDITOR_FILENAMES_TO_REVISE" 72 | 73 | # It allows to specify a single, custom prompt for all sections. For example: 74 | # "proofread and revise the following paragraph"; in this case, the tool will automatically 75 | # append the characters ':\n\n' followed by the paragraph. 76 | # It is also possible to include placeholders in the prompt, which will be replaced 77 | # by the corresponding values. For example, "proofread and revise the following 78 | # paragraph from the section {section_name} of a scientific manuscript with title '{title}'". 79 | # The complete list of placeholders is: {paragraph_text}, {section_name}, 80 | # {title}, {keywords}. 81 | CUSTOM_PROMPT = "AI_EDITOR_CUSTOM_PROMPT" 82 | 83 | # Specifies the source and destination encodings of input and output markdown 84 | # files. Behavior is as follows: 85 | # - If neither SRC_ENCODING nor DEST_ENCODING are specified, the tool will 86 | # attempt to identify the encoding using the charset_normalizer library and 87 | # use that encoding to both read and write the output files. 88 | # - If only SRC_ENCODING is specified, it will be used to both read and write 89 | # the files. 90 | # - If only DEST_ENCODING is specified, it will be used to write the output 91 | # files, and the input files will be read using the encoding identified by 92 | # charset_normalizer. 93 | SRC_ENCODING = "AI_EDITOR_SRC_ENCODING" 94 | DEST_ENCODING = "AI_EDITOR_DEST_ENCODING" 95 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/04.05.results_intro.md: -------------------------------------------------------------------------------- 1 | ### A robust and efficient not-only-linear dependence coefficient 2 | 3 | ![ 4 | **Different types of relationships in data.** 5 | Each panel contains a set of simulated data points described by two generic variables: $x$ and $y$. 6 | The first row shows Anscombe's quartet with four different datasets (from Anscombe I to IV) and 11 data points each. 7 | The second row contains a set of general patterns with 100 data points each. 8 | Each panel shows the correlation value using Pearson ($p$), Spearman ($s$) and CCC ($c$). 9 | Vertical and horizontal red lines show how CCC clustered data points using $x$ and $y$. 10 | ](images/intro/relationships.svg "Different types of relationships in data"){#fig:datasets_rel width="100%"} 11 | 12 | The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values. 13 | The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**. 14 | In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters). 15 | Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1. 16 | Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo). 17 | 18 | 19 | We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns. 20 | In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation). 21 | This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone. 22 | 23 | 24 | Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier. 25 | In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship. 26 | Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman. 27 | In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern. 28 | Anscombe IV shows a vertical line of data points where $x$ values are almost constant except for one outlier. 29 | This outlier does not influence CCC as it does for Pearson or Spearman. 30 | Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$. 31 | This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$. 32 | The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater. 33 | These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns. 34 | However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness. 35 | 36 | 37 | We simulated additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273]. 38 | For the random/independent pair of variables, all coefficients correctly agree with a value close to zero. 39 | The non-coexistence pattern, captured by all coefficients, represents a case where one gene ($x$) might be expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene). 40 | For the other two examples (quadratic and two-lines), Pearson and Spearman do not capture the nonlinear pattern between variables $x$ and $y$. 41 | These patterns also show how CCC uses different degrees of complexity to capture the relationships. 42 | For the quadratic pattern, for example, CCC separates $x$ into more clusters (four in this case) to reach the maximum ARI. 43 | The two-lines example shows two embedded linear relationships with different slopes, which neither Pearson nor Spearman detect ($p=-0.12$ and $s=0.05$, respectively). 44 | Here, CCC increases the complexity of the model by using eight clusters for $x$ and six for $y$, resulting in $c=0.31$. 45 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/04.05.res.md: -------------------------------------------------------------------------------- 1 | ### A robust and efficient not-only-linear dependence coefficient 2 | 3 | ![ 4 | **Different types of relationships in data.** 5 | Each panel contains a set of simulated data points described by two generic variables: $x$ and $y$. 6 | The first row shows Anscombe's quartet with four different datasets (from Anscombe I to IV) and 11 data points each. 7 | The second row contains a set of general patterns with 100 data points each. 8 | Each panel shows the correlation value using Pearson ($p$), Spearman ($s$) and CCC ($c$). 9 | Vertical and horizontal red lines show how CCC clustered data points using $x$ and $y$. 10 | ](images/intro/relationships.svg "Different types of relationships in data"){#fig:datasets_rel width="100%"} 11 | 12 | The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values. 13 | The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**. 14 | In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters). 15 | Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1. 16 | Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo). 17 | 18 | 19 | We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns. 20 | In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation). 21 | This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone. 22 | 23 | 24 | Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier. 25 | In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship. 26 | Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman. 27 | In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern. 28 | Anscombe IV shows a vertical line of data points where $x$ values are almost constant except for one outlier. 29 | This outlier does not influence CCC as it does for Pearson or Spearman. 30 | Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$. 31 | This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$. 32 | The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater. 33 | These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns. 34 | However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness. 35 | 36 | 37 | We simulated additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273]. 38 | For the random/independent pair of variables, all coefficients correctly agree with a value close to zero. 39 | The non-coexistence pattern, captured by all coefficients, represents a case where one gene ($x$) might be expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene). 40 | For the other two examples (quadratic and two-lines), Pearson and Spearman do not capture the nonlinear pattern between variables $x$ and $y$. 41 | These patterns also show how CCC uses different degrees of complexity to capture the relationships. 42 | For the quadratic pattern, for example, CCC separates $x$ into more clusters (four in this case) to reach the maximum ARI. 43 | The two-lines example shows two embedded linear relationships with different slopes, which neither Pearson nor Spearman detect ($p=-0.12$ and $s=0.05$, respectively). 44 | Here, CCC increases the complexity of the model by using eight clusters for $x$ and six for $y$, resulting in $c=0.31$. 45 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms" 3 | keywords: 4 | - genetic studies 5 | - functional genomics 6 | - gene co-expression 7 | - therapeutic targets 8 | - drug repurposing 9 | - clustering of complex traits 10 | lang: en-US 11 | authors: 12 | - name: Milton Pividori 13 | github: miltondp 14 | initials: MP 15 | orcid: 0000-0002-3035-4403 16 | twitter: miltondp 17 | email: miltondp@gmail.com 18 | affiliations: 19 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 20 | funders: 21 | - The Gordon and Betty Moore Foundation GBMF 4552 22 | - The National Human Genome Research Institute (R01 HG010067) 23 | - The National Human Genome Research Institute (K99HG011898) 24 | 25 | - name: Sumei Lu 26 | affiliations: 27 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 28 | 29 | - name: Binglan Li 30 | orcid: 0000-0002-0103-6107 31 | affiliations: 32 | - Department of Biomedical Data Science, Stanford University, Stanford, CA, USA 33 | 34 | - name: Chun Su 35 | orcid: 0000-0001-6388-8666 36 | github: sckinta 37 | affiliations: 38 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 39 | 40 | - name: Matthew E. Johnson 41 | affiliations: 42 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 43 | 44 | - name: Wei-Qi Wei 45 | affiliations: 46 | - Vanderbilt University Medical Center 47 | 48 | - name: Qiping Feng 49 | orcid: 0000-0002-6213-793X 50 | affiliations: 51 | - Vanderbilt University Medical Center 52 | 53 | - name: Bahram Namjou 54 | affiliations: 55 | - Cincinnati Children's Hospital Medical Center 56 | 57 | - name: Krzysztof Kiryluk 58 | orcid: 0000-0002-5047-6715 59 | twitter: kirylukk 60 | affiliations: 61 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians & Surgeons, Columbia University, New York, New York 62 | 63 | - name: Iftikhar Kullo 64 | affiliations: 65 | - Mayo Clinic 66 | 67 | - name: Yuan Luo 68 | affiliations: 69 | - Northwestern University 70 | 71 | - name: Blair D. Sullivan 72 | affiliations: 73 | - School of Computing, University of Utah, Salt Lake City, UT, USA 74 | 75 | - name: Benjamin F. Voight 76 | orcid: 0000-0002-6205-9994 77 | twitter: bvoight28 78 | github: bvoight 79 | affiliations: 80 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 81 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 82 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 83 | 84 | - name: Carsten Skarke 85 | orcid: 0000-0001-5145-3681 86 | twitter: CarstenSkarke 87 | affiliations: 88 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 89 | 90 | - name: Marylyn D. Ritchie 91 | initials: MDR 92 | orcid: 0000-0002-1208-1720 93 | twitter: MarylynRitchie 94 | email: marylyn@pennmedicine.upenn.edu 95 | affiliations: 96 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 97 | 98 | - name: Struan F.A. Grant 99 | email: grants@chop.edu 100 | orcid: 0000-0003-2025-5302 101 | twitter: STRUANGRANT 102 | affiliations: 103 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 104 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 105 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 106 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 107 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 108 | 109 | - name: Casey S. Greene 110 | github: cgreene 111 | initials: CSG 112 | orcid: 0000-0001-8713-9213 113 | twitter: GreeneScientist 114 | email: casey.s.greene@cuanschutz.edu 115 | affiliations: 116 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA 117 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA 118 | funders: 119 | - The Gordon and Betty Moore Foundation (GBMF 4552) 120 | - The National Human Genome Research Institute (R01 HG010067) 121 | - The National Cancer Institute (R01 CA237170) 122 | corresponding: true 123 | -------------------------------------------------------------------------------- /docs/env-vars.md: -------------------------------------------------------------------------------- 1 | # Manubot AI Editor Environment Variables 2 | 3 | Manubot AI Editor provides a variety of options to customize the revision 4 | process. These options are exposed as environment variables, all of which are 5 | prefixed with `AI_EDITOR_`. 6 | 7 | The following environment variables are supported, organized into categories: 8 | 9 | ## Provider Configuration 10 | 11 | This tool refers to services that provide LLMs, such as OpenAI or Anthropic, as 12 | "providers". 13 | 14 | - `AI_EDITOR_MODEL_PROVIDER`: Specifies the provider; currently, the values we 15 | support are "openai" for OpenAI and "anthropic" for Anthropic. 16 | 17 | ## Provider API Key Configuration 18 | 19 | For providers that require API keys, you can specify an API key specific to that provider via an environment variable named `_API_KEY`. 20 | For example, for OpenAI, the API key variable would be named `OPENAI_API_KEY` and for Anthropic, it would be `ANTHROPIC_API_KEY`. 21 | 22 | Alternatively, you can use the environment variable `PROVIDER_API_KEY` to set an API key that will be used for all providers. 23 | If both a provider-specific key and `PROVIDER_API_KEY` are set, the provider-specific key will take precedence. 24 | 25 | ## Model Configuration 26 | 27 | - `AI_EDITOR_LANGUAGE_MODEL`: Language model to use. For example, 28 | "text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-0301", etc. The tool 29 | currently supports the "chat/completions", "completions", and "edits" endpoints, 30 | and you can check compatible models here: 31 | https://platform.openai.com/docs/models/model-endpoint-compatibility 32 | Anthropic models are listed here: 33 | https://docs.anthropic.com/en/docs/about-claude/models/all-models 34 | - `AI_EDITOR_MAX_TOKENS_PER_REQUEST`: Model parameter: `max_tokens` 35 | - `AI_EDITOR_TEMPERATURE`: Model parameter: `temperature` 36 | - `AI_EDITOR_TOP_P`: Model parameter: `top_p` 37 | - `AI_EDITOR_PRESENCE_PENALTY`: Model parameter: `presence_penalty` 38 | - `AI_EDITOR_FREQUENCY_PENALTY`: Model parameter: `frequency_penalty` 39 | - `AI_EDITOR_BEST_OF`: Model parameter: `best_of` 40 | 41 | ## Prompt and Query Control 42 | 43 | - `AI_EDITOR_FILENAME_SECTION_MAPPING`: Allows the user to specify a JSON 44 | string, where keys are filenames and values are section names. For example: 45 | `{"01.intro.md": "introduction"}` Possible values for section names are: 46 | "abstract", "introduction", "results", "discussion", "conclusions", "methods", 47 | and "supplementary material". Take a look at function `get_prompt()` in 48 | [libs/manubot_ai_editor/models.py](https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/models.py#L256) 49 | to see which prompts are used for each section. Although the AI Editor tries to 50 | infer the section name from the filename, sometimes filenames are not 51 | descriptive enough (e.g., "01.intro.md" or "02.review.md" might indicate an 52 | introduction). Mapping filenames to section names is useful to provide more 53 | context to the AI model when revising a paragraph. For example, for the 54 | introduction, prompts contain sentences to preserve most of the citations to 55 | other papers. 56 | - `AI_EDITOR_RETRY_COUNT`: Sometimes the AI model returns an empty paragraph. 57 | Usually, this is resolved by running again the model. The AI Editor will try 58 | five times in these cases. This variable allows to override the number of 59 | retries from its default of 5. 60 | - `AI_EDITOR_FILENAMES_TO_REVISE`: If specified, only these file names will be 61 | revised. Multiple files can be specified, separated by commas. For example: 62 | "01.intro.md,02.review.md" 63 | - `AI_EDITOR_CUSTOM_PROMPT`: Allows the user to specify a single, custom prompt 64 | for all sections. For example: "proofread and revise the following paragraph"; 65 | in this case, the tool will automatically append the characters ':\n\n' followed 66 | by the paragraph. It is also possible to include placeholders in the prompt, 67 | which will be replaced by the corresponding values. For example, "proofread and 68 | revise the following paragraph from the section {section_name} of a scientific 69 | manuscript with title '{title}'". The complete list of placeholders is: 70 | `{paragraph_text}`, `{section_name}`, `{title}`, `{keywords}`. 71 | 72 | ## Encodings 73 | 74 | These vars specify the source and destination encodings of input and output markdown 75 | files. Behavior is as follows: 76 | - If neither `SRC_ENCODING` nor `DEST_ENCODING` are specified, both the input 77 | and output encodings will default to `utf-8`. 78 | - If only `SRC_ENCODING` is specified, it will be used to both read and write 79 | the files. If the special value `_auto_` is used, the tool will attempt to 80 | identify the encoding using the 81 | [charset_normalizer](https://github.com/jawah/charset_normalizer) library, 82 | then use that encoding to both read the input files and write the output 83 | files. 84 | - If only `DEST_ENCODING` is specified, it will be used to write the output 85 | files; the input encoding will be assumed to be `utf-8`. 86 | 87 | The variables: 88 | 89 | - `AI_EDITOR_SRC_ENCODING`: the encoding of the input markdown files 90 | - if empty, defaults to `utf-8`, and 91 | - if `_auto_`, the input encoding is auto-detected. 92 | - `AI_EDITOR_DEST_ENCODING`: the encoding to use when writing the output markdown 93 | files 94 | - if empty, defaults to whatever was used for the source encoding. 95 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/02.introduction.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | New technologies have vastly improved data collection, generating a deluge of information across different disciplines. 4 | This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns. 5 | Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971]. 6 | Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109]. 7 | Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976]. 8 | The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas. 9 | Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research. 10 | 11 | 12 | In transcriptomics, many analyses start with estimating the correlation between genes. 13 | More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540]. 14 | The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573]. 15 | Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003]. 16 | In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342]. 17 | These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks. 18 | Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field. 19 | 20 | 21 | The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly. 22 | However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships. 23 | Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. 24 | MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. 25 | However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001]. 26 | Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855]. 27 | We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899]. 28 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables. 29 | CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time. 30 | CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships. 31 | We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions. 32 | To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776]. 33 | CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients. 34 | For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples. 35 | We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute. 36 | Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259]. 37 | Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories. 38 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc_non_standard_filenames/02.beginning.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | New technologies have vastly improved data collection, generating a deluge of information across different disciplines. 4 | This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns. 5 | Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971]. 6 | Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109]. 7 | Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976]. 8 | The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas. 9 | Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research. 10 | 11 | 12 | In transcriptomics, many analyses start with estimating the correlation between genes. 13 | More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540]. 14 | The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573]. 15 | Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003]. 16 | In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342]. 17 | These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks. 18 | Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field. 19 | 20 | 21 | The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly. 22 | However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships. 23 | Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. 24 | MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. 25 | However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001]. 26 | Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855]. 27 | We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899]. 28 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables. 29 | CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time. 30 | CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships. 31 | We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions. 32 | To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776]. 33 | CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients. 34 | For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples. 35 | We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute. 36 | Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259]. 37 | Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories. 38 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms" 3 | date: 2023-09-09 # Defaults to date generated, but can specify like '2022-10-31'. 4 | keywords: 5 | - genetic studies 6 | - functional genomics 7 | - gene co-expression 8 | - therapeutic targets 9 | - drug repurposing 10 | - clustering of complex traits 11 | lang: en-US 12 | authors: 13 | - name: Milton Pividori 14 | github: miltondp 15 | initials: MP 16 | orcid: 0000-0002-3035-4403 17 | twitter: miltondp 18 | mastodon: miltondp 19 | mastodon-server: genomic.social 20 | email: milton.pividori@cuanschutz.edu 21 | affiliations: 22 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA 23 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 24 | funders: 25 | - The Gordon and Betty Moore Foundation GBMF 4552 26 | - The National Human Genome Research Institute (R01 HG010067) 27 | - The National Human Genome Research Institute (K99HG011898) 28 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765) 29 | 30 | - name: Sumei Lu 31 | affiliations: 32 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 33 | 34 | - name: Binglan Li 35 | orcid: 0000-0002-0103-6107 36 | affiliations: 37 | - Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA. 38 | 39 | - name: Chun Su 40 | orcid: 0000-0001-6388-8666 41 | github: sckinta 42 | affiliations: 43 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 44 | 45 | - name: Matthew E. Johnson 46 | affiliations: 47 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 48 | 49 | - name: Wei-Qi Wei 50 | affiliations: 51 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA 52 | 53 | - name: Qiping Feng 54 | orcid: 0000-0002-6213-793X 55 | affiliations: 56 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA 57 | 58 | - name: Bahram Namjou 59 | affiliations: 60 | - Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA 61 | 62 | - name: Krzysztof Kiryluk 63 | orcid: 0000-0002-5047-6715 64 | twitter: kirylukk 65 | affiliations: 66 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians \& Surgeons, Columbia University, New York, NY 10032, USA 67 | 68 | - name: Iftikhar Kullo 69 | affiliations: 70 | - Mayo Clinic, Rochester, MN 55905, USA 71 | 72 | - name: Yuan Luo 73 | orcid: 0000-0003-0195-7456 74 | affiliations: 75 | - Northwestern University, Chicago, IL 60611, USA 76 | 77 | - name: Blair D. Sullivan 78 | github: bdsullivan 79 | orcid: 0000-0001-7720-6208 80 | twitter: blairdsullivan 81 | affiliations: 82 | - Kahlert School of Computing, University of Utah, Salt Lake City, UT 84112, USA 83 | 84 | - name: Benjamin F. Voight 85 | orcid: 0000-0002-6205-9994 86 | twitter: bvoight28 87 | github: bvoight 88 | affiliations: 89 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 90 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 91 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 92 | 93 | - name: Carsten Skarke 94 | orcid: 0000-0001-5145-3681 95 | twitter: CarstenSkarke 96 | affiliations: 97 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 98 | 99 | - name: Marylyn D. Ritchie 100 | initials: MDR 101 | orcid: 0000-0002-1208-1720 102 | twitter: MarylynRitchie 103 | email: marylyn@pennmedicine.upenn.edu 104 | affiliations: 105 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 106 | 107 | - name: Struan F.A. Grant 108 | email: grants@chop.edu 109 | orcid: 0000-0003-2025-5302 110 | twitter: STRUANGRANT 111 | affiliations: 112 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 113 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 114 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 115 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 116 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 117 | 118 | - name: Casey S. Greene 119 | github: cgreene 120 | initials: CSG 121 | orcid: 0000-0001-8713-9213 122 | twitter: GreeneScientist 123 | mastodon: greenescientist 124 | mastodon-server: genomic.social 125 | email: casey.s.greene@cuanschutz.edu 126 | affiliations: 127 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA 128 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA 129 | funders: 130 | - The Gordon and Betty Moore Foundation (GBMF 4552) 131 | - The National Human Genome Research Institute (R01 HG010067) 132 | - The National Cancer Institute (R01 CA237170) 133 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765) 134 | corresponding: true 135 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full_only_first_para/content/metadata.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms" 3 | date: 2023-09-09 # Defaults to date generated, but can specify like '2022-10-31'. 4 | keywords: 5 | - genetic studies 6 | - functional genomics 7 | - gene co-expression 8 | - therapeutic targets 9 | - drug repurposing 10 | - clustering of complex traits 11 | lang: en-US 12 | authors: 13 | - name: Milton Pividori 14 | github: miltondp 15 | initials: MP 16 | orcid: 0000-0002-3035-4403 17 | twitter: miltondp 18 | mastodon: miltondp 19 | mastodon-server: genomic.social 20 | email: milton.pividori@cuanschutz.edu 21 | affiliations: 22 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA 23 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 24 | funders: 25 | - The Gordon and Betty Moore Foundation GBMF 4552 26 | - The National Human Genome Research Institute (R01 HG010067) 27 | - The National Human Genome Research Institute (K99HG011898) 28 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765) 29 | 30 | - name: Sumei Lu 31 | affiliations: 32 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 33 | 34 | - name: Binglan Li 35 | orcid: 0000-0002-0103-6107 36 | affiliations: 37 | - Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA. 38 | 39 | - name: Chun Su 40 | orcid: 0000-0001-6388-8666 41 | github: sckinta 42 | affiliations: 43 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 44 | 45 | - name: Matthew E. Johnson 46 | affiliations: 47 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 48 | 49 | - name: Wei-Qi Wei 50 | affiliations: 51 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA 52 | 53 | - name: Qiping Feng 54 | orcid: 0000-0002-6213-793X 55 | affiliations: 56 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA 57 | 58 | - name: Bahram Namjou 59 | affiliations: 60 | - Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA 61 | 62 | - name: Krzysztof Kiryluk 63 | orcid: 0000-0002-5047-6715 64 | twitter: kirylukk 65 | affiliations: 66 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians \& Surgeons, Columbia University, New York, NY 10032, USA 67 | 68 | - name: Iftikhar Kullo 69 | affiliations: 70 | - Mayo Clinic, Rochester, MN 55905, USA 71 | 72 | - name: Yuan Luo 73 | orcid: 0000-0003-0195-7456 74 | affiliations: 75 | - Northwestern University, Chicago, IL 60611, USA 76 | 77 | - name: Blair D. Sullivan 78 | github: bdsullivan 79 | orcid: 0000-0001-7720-6208 80 | twitter: blairdsullivan 81 | affiliations: 82 | - Kahlert School of Computing, University of Utah, Salt Lake City, UT 84112, USA 83 | 84 | - name: Benjamin F. Voight 85 | orcid: 0000-0002-6205-9994 86 | twitter: bvoight28 87 | github: bvoight 88 | affiliations: 89 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 90 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 91 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 92 | 93 | - name: Carsten Skarke 94 | orcid: 0000-0001-5145-3681 95 | twitter: CarstenSkarke 96 | affiliations: 97 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 98 | 99 | - name: Marylyn D. Ritchie 100 | initials: MDR 101 | orcid: 0000-0002-1208-1720 102 | twitter: MarylynRitchie 103 | email: marylyn@pennmedicine.upenn.edu 104 | affiliations: 105 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 106 | 107 | - name: Struan F.A. Grant 108 | email: grants@chop.edu 109 | orcid: 0000-0003-2025-5302 110 | twitter: STRUANGRANT 111 | affiliations: 112 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 113 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 114 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 115 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 116 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA 117 | 118 | - name: Casey S. Greene 119 | github: cgreene 120 | initials: CSG 121 | orcid: 0000-0001-8713-9213 122 | twitter: GreeneScientist 123 | mastodon: greenescientist 124 | mastodon-server: genomic.social 125 | email: casey.s.greene@cuanschutz.edu 126 | affiliations: 127 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA 128 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA 129 | funders: 130 | - The Gordon and Betty Moore Foundation (GBMF 4552) 131 | - The National Human Genome Research Institute (R01 HG010067) 132 | - The National Cancer Institute (R01 CA237170) 133 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765) 134 | corresponding: true 135 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the overall 27 | community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or advances of 32 | any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email address, 36 | without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official email address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at 64 | [milton.pividori@cuanschutz.edu](mailto:milton.pividori@cuanschutz.edu). 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series of 87 | actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or permanent 94 | ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within the 114 | community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.1, available at 120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 127 | [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | -------------------------------------------------------------------------------- /tests/manuscripts/mutator-epistasis/02.introduction.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | Germline mutation rates reflect the complex interplay between DNA proofreading and repair pathways, exogenous sources of DNA damage, and life-history traits. 4 | For example, parental age is an important determinant of mutation rate variability; in many mammalian species, the number of germline *de novo* mutations observed in offspring increases as a function of paternal and maternal age [@PMID:28959963;@PMID:31549960;@PMID:35771663;@PMID:32804933;@PMID:31492841]. 5 | Rates of germline mutation accumulation are also variable across human families [@PMID:26656846;@PMID:31549960], likely due to either genetic variation or differences in environmental exposures. 6 | Although numerous protein-coding genes contribute to the maintenance of genome integrity, genetic variants that increase germline mutation rates, known as *mutator alleles*, have proven difficult to discover in mammals. 7 | 8 | The dearth of observed germline mutators in mammalian genomes is not necessarily surprising, since alleles that lead to elevated germline mutation rates would likely have deleterious consequences and be purged by negative selection if their effect sizes are large [@PMID:27739533]. 9 | Moreover, germline mutation rates are relatively low, and direct mutation rate measurements require whole-genome sequencing data from both parents and their offspring. 10 | As a result, large-scale association studies — which have been used to map the contributions of common genetic variants to many complex traits — are not currently well-powered to investigate the polygenic architecture of germline mutation rates [@PMID:31964835]. 11 | 12 | Despite these challenges, less traditional strategies have been used to identify a small number of mutator alleles in humans, macaques [@doi:10.1101/2023.03.27.534460], and mice. 13 | By focusing on families with rare genetic diseases, a recent study discovered two mutator alleles that led to significantly elevated rates of *de novo* germline mutation in human genomes [@PMID:35545669]. 14 | Other groups have observed mutator phenotypes in the germlines and somatic tissues of adults who carry cancer-predisposing inherited mutations in the POLE/POLD1 exonucleases [@PMID:34594041;@PMID:37336879]. 15 | Candidate mutator loci were also found by identifying human haplotypes from the Thousand Genomes Project with excess counts of derived alleles in genomic windows [@PMID:28095480]. 16 | 17 | In mice, a germline mutator allele was recently discovered by sequencing a large family of inbred mice [@PMID:35545679]. 18 | Commonly known as the BXDs, these recombinant inbred lines (RILs) were derived from either F2 or advanced intercrosses of C57BL/6J and DBA/2J, two laboratory strains that exhibit significant differences in their germline mutation spectra [@PMID:33472028;@PMID:30753674]. 19 | The BXDs were maintained via brother-sister mating for up to 180 generations, and each BXD therefore accumulated hundreds or thousands of germline mutations on a nearly-homozygous linear mosaic of parental B and D haplotypes. 20 | Due to their husbandry in a controlled laboratory setting, the BXDs were largely free from confounding by environmental heterogeneity, and the effects of selection on *de novo* mutations were attenuated by strict inbreeding [@doi:10.1146/annurev.ecolsys.39.110707.173437]. 21 | 22 | In this previous study, whole-genome sequencing data from the BXD family were used to map a quantitative trait locus (QTL) for the C>A mutation rate [@PMID:35545679]. 23 | Germline C>A mutation rates were nearly 50% higher in mice with *D* haplotypes at the QTL, likely due to genetic variation in the DNA glycosylase *Mutyh* that reduced the efficacy of oxidative DNA damage repair. 24 | Pathogenic variants of *Mutyh* also appear to act as mutators in normal human germline and somatic tissues [@PMID:35803914;@PMID:30753674]. 25 | Importantly, the QTL did not reach genome-wide significance in a scan for variation in overall germline mutation rates, which were only modestly higher in BXDs with *D* alleles, demonstrating the utility of mutation spectrum analysis for mutator allele discovery. 26 | Close examination of the mutation spectrum is likely to be broadly useful for detecting mutator alleles, as genes involved in DNA proofreading and repair often recognize particular sequence motifs or excise specific types of DNA lesions [@PMID:32619789]. 27 | Mutation spectra are usually defined in terms of $k$-mer nucleotide context; the 1-mer mutation spectrum, for example, consists of 6 mutation types after collapsing by strand complement (C>T, C>A, C>G, A>T, A>C, A>G), while the 3-mer mutation spectrum contains 96 (each of the 1-mer mutations partitioned by trinucleotide context). 28 | 29 | Although mutation spectrum analysis can enable the discovery of mutator alleles that affect the rates of specific mutation types, early implementations of this strategy have suffered from a few drawbacks. 30 | For example, performing association tests on the rates or fractions of every $k$-mer mutation type can quickly incur a substantial multiple testing burden. 31 | Since germline mutation rates are generally quite low, estimates of $k$-mer mutation type frequencies from individual samples can also be noisy and imprecise. 32 | Moreover, inbreeding duration can vary considerably across samples in populations of RILs; for example, some BXDs were inbred for only 20 generations, while others were inbred for nearly 200. 33 | As a result, the variance of individual $k$-mer mutation rate estimates in those populations will be much higher than if all samples were inbred for the same duration. 34 | We were therefore motivated to develop a statistical method that could overcome the sparsity of *de novo* mutation spectra, eliminate the need to test each $k$-mer mutation type separately, and enable sensitive detection of alleles that influence the germline mutation spectrum. 35 | 36 | Here, we present a new mutation spectrum association test, called "aggregate mutation spectrum distance," that minimizes multiple testing burdens and mitigates the challenges of sparsity in *de novo* mutation datasets. 37 | We leverage this method to re-analyze germline mutation data from the BXD family and find compelling evidence for a second mutator allele that was not detected using previous approaches. 38 | The new allele appears to interact epistatically with the mutator that was previously discovered in the BXDs, further augmenting the C>A germline mutation rate in a subset of inbred mice. 39 | Our observation of epistasis suggests that mild DNA repair deficiencies can compound one another, as mutator alleles chip away at the redundant systems that collectively maintain germline integrity. 40 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/04.10.results_comp.md: -------------------------------------------------------------------------------- 1 | ### The CCC reveals linear and nonlinear patterns in human transcriptomic data 2 | 3 | We next examined the characteristics of these correlation coefficients in gene expression data from GTEx v8 across different tissues. 4 | We selected the top 5,000 genes with the largest variance for our initial analyses on whole blood and then computed the correlation matrix between genes using Pearson, Spearman and CCC (see [Methods](#sec:data_gtex)). 5 | 6 | 7 | We examined the distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs). 8 | CCC (mean=0.14, median=0.08, sd=0.15) has a much more skewed distribution than Pearson (mean=0.31, median=0.24, sd=0.24) and Spearman (mean=0.39, median=0.37, sd=0.26). 9 | The coefficients reach a cumulative set containing 70% of gene pairs at different values (Figure @fig:dist_coefs b), $c=0.18$, $p=0.44$ and $s=0.56$, suggesting that for this type of data, the coefficients are not directly comparable by magnitude, so we used ranks for further comparisons. 10 | In GTEx v8, CCC values were closer to Spearman and vice versa than either was to Pearson (Figure @fig:dist_coefs c). 11 | We also compared the Maximal Information Coefficient (MIC) in this data (see [Supplementary Note 1](#sec:mic)). 12 | We found that CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)). 13 | MIC, an advanced correlation coefficient able to capture general patterns beyond linear relationships, represented a significant step forward in correlation analysis research and has been successfully used in various application domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. 14 | These results suggest that our findings for CCC generalize to MIC, therefore, in the subsequent analyses we focus on CCC and linear-only coefficients. 15 | 16 | 17 | ![ 18 | **Distribution of coefficient values on gene expression (GTEx v8, whole blood).** 19 | **a)** Histogram of coefficient values. 20 | **b)** Corresponding cumulative histogram. The dotted line maps the coefficient value that accumulates 70% of gene pairs. 21 | **c)** 2D histogram plot with hexagonal bins between all coefficients, where a logarithmic scale was used to color each hexagon. 22 | ](images/coefs_comp/gtex_whole_blood/dist-main.svg "Distribution of coefficient values"){#fig:dist_coefs width="100%"} 23 | 24 | 25 | A closer inspection of gene pairs that were either prioritized or disregarded by these coefficients revealed that they captured different patterns. 26 | We analyzed the agreements and disagreements by obtaining, for each coefficient, the top 30% of gene pairs with the largest correlation values ("high" set) and the bottom 30% ("low" set), resulting in six potentially overlapping categories. 27 | For most cases (76.4%), an UpSet analysis [@doi:10.1109/TVCG.2014.2346248] (Figure @fig:upsetplot_coefs a) showed that the three coefficients agreed on whether there is a strong correlation (42.1%) or there is no relationship (34.3%). 28 | Since Pearson and Spearman are linear-only, and CCC can also capture these patterns, we expect that these concordant gene pairs represent clear linear patterns. 29 | CCC and Spearman agree more on either highly or poorly correlated pairs (4.0% in "high", and 7.0% in "low") than any of these with Pearson (all between 0.3%-3.5% for "high", and 2.8%-5.5% for "low"). 30 | In summary, CCC agrees with either Pearson or Spearman in 90.5% of gene pairs by assigning a high or a low correlation value. 31 | 32 | ![ 33 | **Intersection of gene pairs with high and low correlation coefficient values (GTEx v8, whole blood).** 34 | **a)** UpSet plot with six categories (rows) grouping the 30% of the highest (green triangle) and lowest (red triangle) values for each coefficient. 35 | Columns show different intersections of categories grouped by agreements and disagreements. 36 | **b)** Hexagonal binning plots with examples of gene pairs where CCC ($c$) disagrees with Pearson ($p$) and Spearman ($s$). 37 | For each method, colors in the triangles indicate if the gene pair is among the top (green) or bottom (red) 30% of coefficient values. 38 | No triangle means that the correlation value for the gene pair is between the 30th and 70th percentiles (neither low nor high). 39 | A logarithmic scale was used to color each hexagon. 40 | ](images/coefs_comp/gtex_whole_blood/upsetplot-main.svg "Intersection of gene pairs"){#fig:upsetplot_coefs width="100%"} 41 | 42 | 43 | While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure @fig:upsetplot_coefs a). 44 | There were also gene pairs with a high Pearson value and either low CCC (1,075), low Spearman (87) or both low CCC and low Spearman values (531). 45 | However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later). 46 | We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees with Pearson, Spearman or both. 47 | 48 | ![ 49 | **The expression levels of *KDM6A* and *UTY* display sex-specific associations across GTEx tissues.** 50 | CCC captures this nonlinear relationship in all GTEx tissues (nine examples are shown in the first three rows), except in female-specific organs (last row). 51 | ](images/coefs_comp/kdm6a_vs_uty/gtex-KDM6A_vs_UTY-main.svg "KDM6A and UTY across different GTEx tissues"){#fig:gtex_tissues:kdm6a_uty width="95%"} 52 | 53 | The first three gene pairs at the top (*IFNG* - *SDS*, *JUN* - *APOC1*, and *ZDHHC12* - *CCL18*), with high CCC and low Pearson values, appear to follow a non-coexistence relationship: in samples where one of the genes is highly (slightly) expressed, the other is slightly (highly) activated, suggesting a potentially inhibiting effect. 54 | The following three gene pairs (*UTY* - *KDM6A*, *RASSF2* - *CYTIP*, and *AC068580.6* - *KLHL21*) follow patterns combining either two linear or one linear and one independent relationships. 55 | In particular, genes *UTY* and *KDM6A* (paralogs) show a nonlinear relationship where a subset of samples follows a robust linear pattern and another subset has a constant (independent) expression of one gene. 56 | This relationship is explained by the fact that *UTY* is in chromosome Y (Yq11) whereas *KDM6A* is in chromosome X (Xp11), and samples with a linear pattern are males, whereas those with no expression for *UTY* are females. 57 | This combination of linear and independent patterns is captured by CCC ($c=0.29$, above the 80th percentile) but not by Pearson ($p=0.24$, below the 55th percentile) or Spearman ($s=0.10$, below the 15th percentile). 58 | Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs (Figure @fig:gtex_tissues:kdm6a_uty). 59 | -------------------------------------------------------------------------------- /docs/custom-prompts.md: -------------------------------------------------------------------------------- 1 | # Custom Prompts 2 | 3 | Rather than using the default prompt, you can specify custom prompts for each file in your manuscript. 4 | This can be useful when you want specific sections of your manuscript to be revised in specific ways, or not revised at all. 5 | 6 | There are two ways that you can use the custom prompts system: 7 | 1. You can define your prompts and how they map to your manuscript files in a single file, `ai-revision-prompts.yaml`. 8 | 2. You can create the `ai-revision-prompts.yaml`, but only specify prompts and identifiers, which makes it suitable for sharing with others who have different names for their manuscripts' files. 9 | You would then specify a second file, `ai-revision-config.yaml`, that maps the prompt identifiers to the actual files in your manuscript. 10 | 11 | These files should be placed in the `ci` directory under your manubot root directory. 12 | 13 | See [Functionality Notes](#functionality-notes) later in this document for more information on how to write regular expressions and use placeholders in your prompts. 14 | 15 | 16 | ## Approach 1: Single file 17 | 18 | With this approach, you can define your prompts and how they map to your manuscript files in a single file. 19 | The single file should be named `ai-revision-prompts.yaml` and placed in the `ci` folder. 20 | 21 | The file would look something like the following: 22 | 23 | ```yaml 24 | prompts_files: 25 | # filenames are specified as regular expressions 26 | # in this case, we match a file named exactly 'filename.md' 27 | ^filename\.md$: "Prompt text here" 28 | 29 | # you can use YAML's multi-line string syntax to write longer prompts 30 | # you can also use {placeholders} to include metadata from your manuscript 31 | ^filename\.md$: | 32 | Revise the following paragraph from a manuscript titled {title} 33 | so that it sounds like an academic paper. 34 | 35 | # specifying the special value 'null' will skip revising any files that 36 | # match this regular expression 37 | ^ignore_this_file\.md$: null 38 | ``` 39 | 40 | Note that, for each file, the first matching regular expression will determine its prompt or whether the file is skipped. 41 | Even if a file matches multiple regexes, only the first one will be used. 42 | 43 | 44 | ## Approach 2: Prompt file plus configuration file 45 | 46 | In this case, we specify two files, `ai-revision-prompts.yaml` and `ai-revision-config.yaml`. 47 | 48 | The `ai-revision-prompts.yaml` file contains only the prompts and their identifiers. 49 | The top-level element is `prompts` in this case rather than `prompts_files`, as it defines a set of resuable prompts and not prompt-file mappings. 50 | 51 | Here's an example of what the `ai-revision-prompts.yaml` file might look like: 52 | ```yaml 53 | prompts: 54 | intro_prompt: "Prompt text here" 55 | content_prompts: | 56 | Revise the following paragraph from a manuscript titled {title} 57 | so that it sounds like an academic paper. 58 | 59 | my_default: "Revise this paragraph so it sounds nicer." 60 | ``` 61 | 62 | The `ai-revision-config.yaml` file maps the prompt identifiers to the actual files in your manuscript. 63 | 64 | An example of the `ai-revision-config.yaml` file: 65 | ```yaml 66 | files: 67 | matchings: 68 | - files: 69 | - ^introduction\.md$ 70 | prompt: intro_prompt 71 | - files: 72 | - ^methods\.md$ 73 | - ^abstract\.md$ 74 | prompt: content_prompts 75 | 76 | # the special value default_prompt is used when no other regex matches 77 | # it also uses a prompt identifier taken from ai-revision-prompts.yaml 78 | default_prompt: my_default 79 | 80 | # any file you want to be skipped can be specified in this list 81 | ignores: 82 | - ^ignore_this_file\.md$ 83 | ``` 84 | 85 | Multiple regexes can be specified in a list under `files` to match multiple files to a single prompt. 86 | 87 | In this case, the `default_prompt` is used when no other regex matches, and it uses a prompt identifier taken from `ai-revision-prompts.yaml`. 88 | 89 | The `ignores` list specifies files that should be skipped entirely during the revision process; they won't have the default prompt applied to them. 90 | 91 | 92 | ## Functionality Notes 93 | 94 | ### Filenames as Regular Expressions 95 | 96 | Filenames in either approach are specified as regular expressions (aka "regexes"). 97 | This allows you to flexibly match multiple files to a prompt with a single expression. 98 | 99 | A simple example: to specify an exact match for, say, `myfile.md`, you'd supply the regular expression `^myfile\.md$`, where: 100 | - `^` matches the beginning of the filename 101 | - `\.` matches a literal period -- otherwise, `.` means "any character" 102 | - `$` matches the end of the filename 103 | 104 | To illustrate why that syntax is important: if you were to write it as `myfile.md`, the `.` would match any character, so it would match `myfileAmd`, `myfile2md`, etc. 105 | Without the `^` and `$`, it would match also match filenames like `asdf_myfile.md`, `myfile.md_asdf`, and `asdf_myfile.md.txt`. 106 | 107 | The benefit of using regexes becomes more apparent when you have multiple files. 108 | For example, say you had three files, `02.prior-work.md`, `02.methods.md`, and `02.results.md`. To match all of these, you could use the expression `^02\..*\.md$`. 109 | This would match any file beginning with `02.` and ending with `.md`. 110 | Here, `.` again indicates "any character" and the `*` means "zero or more of the preceding character; together, they match any sequence of characters. 111 | 112 | You can find more information on how to write regular expressions in [Python's `re` module documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax). 113 | 114 | 115 | ### Placeholders 116 | 117 | The prompt text can include metadata from your manuscript, specified in `content/metadata.yaml` in Manubot. Writing 118 | `{placeholder}` into your prompt text will cause it to be replaced with the corresponding value, drawn either 119 | from the manuscript metadata or from the current file/paragraph being revised. 120 | 121 | The following placeholders are available: 122 | - `{title}`: the title of the manuscript, as defined in the metadata 123 | - `{keywords}`: comma-delimited keywords from the manuscript metadata 124 | - `{paragraph_text}`: the text from the current paragraph 125 | - `{section_name}`: the name of the section (which is one of the following values "abstract", "introduction", "results", "discussion", "conclusions", "methods" or "supplementary material"), derived from the filename. 126 | 127 | The `section_name` placeholder works like so: 128 | - if the env var `AI_EDITOR_FILENAME_SECTION_MAPPING` is specified, it will be interpreted as a dictionary mapping filenames to section names. 129 | If a key of the dictionary is included in the filename, the value will be used as the section name. 130 | Also the keys and values can be any string, not just one of the section names mentioned before. 131 | - If the dict mentioned above is unset or the filename doesn't match any of its keys, the filename will be matched against the following values: "introduction", "methods", "results", "discussion", "conclusions" or "supplementary". 132 | If the values are contained within the filename, the section name will be mapped to that value. "supplementary" is replaced with "supplementary material", but the others are used as is. 133 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/04.12.results_giant.md: -------------------------------------------------------------------------------- 1 | ### Replication of gene associations using tissue-specific gene networks from GIANT 2 | 3 | We sought to systematically analyze discrepant scores to assess whether associations were replicated in other datasets besides GTEx. 4 | This is challenging and prone to bias because linear-only correlation coefficients are usually used in gene co-expression analyses. 5 | We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@pmcid:PMC4828725; @url:https://hb.flatironinstitute.org], where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes (see [Methods](#sec:giant)). 6 | Importantly, the version of GIANT used in this study did not include GTEx samples [@url:https://hb.flatironinstitute.org/data], making it an ideal case for replication. 7 | These networks were built from expression and different interaction measurements, including protein-interaction, transcription factor regulation, chemical/genetic perturbations and microRNA target profiles from the Molecular Signatures Database (MSigDB [@pmid:16199517]). 8 | We reasoned that highly-ranked gene pairs using three different coefficients in a single tissue (whole blood in GTEx, Figure @fig:upsetplot_coefs) that represented real patterns should often replicate in a corresponding tissue or related cell lineage using the multi-cell type functional interaction networks in GIANT. 9 | In addition to predicting a network with interactions for a pair of genes, the GIANT web application can also automatically detect a relevant tissue or cell type where genes are predicted to be specifically expressed (the approach uses a machine learning method introduced in [@doi:10.1101/gr.155697.113] and described in [Methods](#sec:giant)). 10 | For example, we obtained the networks in blood and the automatically-predicted cell type for gene pairs *RASSF2* - *CYTIP* (CCC high, Figure @fig:giant_gene_pairs a) and *MYOZ1* - *TNNI2* (Pearson high, Figure @fig:giant_gene_pairs b). 11 | In addition to the gene pair, the networks include other genes connected according to their probability of interaction (up to 15 additional genes are shown), which allows estimating whether genes are part of the same tissue-specific biological process. 12 | Two large black nodes in each network's top-left and bottom-right corners represent our gene pairs. 13 | A green edge means a close-to-zero probability of interaction, whereas a red edge represents a strong predicted relationship between the two genes. 14 | In this example, genes *RASSF2* and *CYTIP* (Figure @fig:giant_gene_pairs a), with a high CCC value ($c=0.20$, above the 73th percentile) and low Pearson and Spearman ($p=0.16$ and $s=0.11$, below the 38th and 17th percentiles, respectively), were both strongly connected to the blood network, with interaction scores of at least 0.63 and an average of 0.75 and 0.84, respectively (Supplementary Table @tbl:giant:weights). 15 | The autodetected cell type for this pair was leukocytes, and interaction scores were similar to the blood network (Supplementary Table @tbl:giant:weights). 16 | However, genes *MYOZ1* and *TNNI2*, with a very high Pearson value ($p=0.97$), moderate Spearman ($s=0.28$) and very low CCC ($c=0.03$), were predicted to belong to much less cohesive networks (Figure @fig:giant_gene_pairs b), with average interaction scores of 0.17 and 0.22 with the rest of the genes, respectively. 17 | Additionally, the autodetected cell type (skeletal muscle) is not related to blood or one of its cell lineages. 18 | These preliminary results suggested that CCC might be capturing blood-specific patterns missed by the other coefficients. 19 | 20 | ![ 21 | **Analysis of GIANT tissue-specific predicted networks for gene pairs prioritized by correlation coefficients.** 22 | **a-b)** Two gene pairs prioritized by correlation coefficients (from Figure @fig:upsetplot_coefs b) with their predicted networks in blood (left) and an automatically selected tissue/cell type (right) using the method described in [@doi:10.1101/gr.155697.113]. 23 | A node represents a gene and an edge the probability that two genes are part of the same biological process in a specific cell type. 24 | A maximum of 15 genes are shown for each network. 25 | The GIANT web application automatically determined a minimum interaction confidence (edges' weights) to be shown. 26 | These networks can be analyzed online using the following links: 27 | *RASSF2* - *CYTIP* [@url:https://hb.flatironinstitute.org/gene/9770+9595], 28 | *MYOZ1* - *TNNI2* [@url:https://hb.flatironinstitute.org/gene/58529+7136]. 29 | **c)** Summary of predicted tissue/cell type networks for gene pairs exclusively prioritized by CCC and Pearson. 30 | The first row combines all gene pairs where CCC is high and Pearson or Spearman are low. 31 | The second row combines all gene pairs where Pearson is high and CCC or Spearman are low. 32 | Bar plots (left) show the number of gene pairs for each predicted tissue/cell type. 33 | Box plots (right) show the average probability of interaction between genes in these predicted tissue-specific networks. 34 | Red indicates CCC-only tissues/cell types, blue are Pearson-only, and purple are shared. 35 | ](images/coefs_comp/giant_networks/top_gene_pairs-main.svg "GIANT network interaction on gene pairs"){#fig:giant_gene_pairs width="100%"} 36 | 37 | 38 | We next performed a systematic evaluation using the top 100 discrepant gene pairs between CCC and the other two coefficients. 39 | For each gene pair prioritized in GTEx (whole blood), we autodetected a relevant cell type using GIANT to assess whether genes were predicted to be specifically expressed in a blood-relevant cell lineage. 40 | For this, we used the top five most commonly autodetected cell types for each coefficient and assessed connectivity in the resulting networks (see [Methods](#sec:giant)). 41 | The top 5 predicted cell types for gene pairs highly ranked by CCC and not by the rest were all blood-specific (Figure @fig:giant_gene_pairs c, top left), including macrophage, leukocyte, natural killer cell, blood and mononuclear phagocyte. 42 | The average probability of interaction between genes in these CCC-ranked networks was significantly higher than the other coefficients (Figure @fig:giant_gene_pairs c, top right), with all medians larger than 67% and first quartiles above 41% across predicted cell types. 43 | In contrast, most Pearson's gene pairs were predicted to be specific to tissues unrelated to blood (Figure @fig:giant_gene_pairs c, bottom left), with skeletal muscle being the most commonly predicted tissue. 44 | The interaction probabilities in these Pearson-ranked networks were also generally lower than in CCC, except for blood-specific gene pairs (Figure @fig:giant_gene_pairs c, bottom right). 45 | The associations exclusively detected by CCC in whole blood from GTEx were more strongly replicated in these independent networks that incorporated multiple data modalities. 46 | CCC-ranked gene pairs not only had high probabilities of belonging to the same biological process but were also predicted to be specifically expressed in blood cell lineages. 47 | Conversely, most Pearson-ranked gene pairs were not predicted to be blood-specific, and their interaction probabilities were relatively low. 48 | This lack of replication in GIANT suggests that top Pearson-ranked gene pairs in GTEx might be driven mainly by outliers, which is consistent with our earlier observations of outlier-driven associations (Figure @fig:upsetplot_coefs b). 49 | -------------------------------------------------------------------------------- /tests/manuscripts/phenoplier_full/content/02.introduction.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | Genes work together in context-specific networks to carry out different functions [@pmid:19104045; @doi:10.1038/ng.3259]. 4 | Variations in these genes can change their functional role and, at a higher level, affect disease-relevant biological processes [@doi:10.1038/s41467-018-06022-6]. 5 | In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776; @doi:10.1038/s41586-020-2559-3; @doi:10.1038/s41576-019-0200-9], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489]. 6 | Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787; @doi:10.1038/s41586-020-03145-z; @doi:10.1038/s41586-020-2559-3] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776]. 7 | Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4; @doi:10.1038/ncomms6890] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105]. 8 | 9 | 10 | Given the availability of gene expression data across several tissues [@doi:10.1038/nbt.3838; @doi:10.1038/s41467-018-03751-6; @doi:10.1126/science.aaz1776; @doi:10.1186/s13040-020-00216-9], an effective approach to identify these biological processes is the transcription-wide association study (TWAS), which integrates expression quantitative trait loci (eQTLs) data to provide a mechanistic interpretation for GWAS findings. 11 | TWAS relies on testing whether perturbations in gene regulatory mechanisms mediate the association between genetic variants and human diseases [@doi:10.1371/journal.pgen.1009482; @doi:10.1038/ng.3506; @doi:10.1371/journal.pgen.1007889; @doi:10.1038/ng.3367], and these approaches have been highly successful not only in understanding disease etiology at the transcriptome level [@pmid:33931583; @doi:10.1101/2021.10.21.21265225; @pmid:31036433] but also in disease-risk prediction (polygenic scores) [@doi:10.1186/s13059-021-02591-w] and drug repurposing [@doi:10.1038/nn.4618] tasks. 12 | However, TWAS works at the individual gene level, which does not capture more complex interactions at the network level. 13 | 14 | 15 | These gene-gene interactions play a crucial role in current theories of the architecture of complex traits, such as the omnigenic model [@doi:10.1016/j.cell.2017.05.038], which suggests that methods need to incorporate this complexity to disentangle disease-relevant mechanisms. 16 | Widespread gene pleiotropy, for instance, reveals the highly interconnected nature of transcriptional networks [@doi:10.1038/s41588-019-0481-0; @doi:10.1038/ng.3570], where potentially all genes expressed in disease-relevant cell types have a non-zero effect on the trait [@doi:10.1016/j.cell.2017.05.038; @doi:10.1016/j.cell.2019.04.014]. 17 | One way to learn these gene-gene interactions is using the concept of gene module: a group of genes with similar expression profiles across different conditions [@pmid:22955619; @pmid:25344726; @doi:10.1038/ng.3259]. 18 | In this context, several unsupervised approaches have been proposed to infer these gene-gene connections by extracting gene modules from co-expression patterns [@pmid:9843981; @pmid:24662387; @pmid:16333293]. 19 | Matrix factorization techniques like independent or principal component analysis (ICA/PCA) have shown superior performance in this task [@doi:10.1038/s41467-018-03424-4] since they capture local expression effects from a subset of samples and can handle modules overlap effectively. 20 | Therefore, integrating genetic studies with gene modules extracted using unsupervised learning could further improve our understanding of disease origin [@pmid:25344726] and progression [@pmid:18631455]. 21 | 22 | 23 | Here we propose PhenoPLIER, an omnigenic approach that provides a gene module perspective to genetic studies. 24 | The flexibility of our method allows integrating different data modalities into the same representation for a joint analysis. 25 | We show that this module perspective can infer how groups of functionally-related genes influence complex traits, detect shared and distinct transcriptomic properties among traits, and predict how pharmacological perturbations affect genes' activity to exert their effects. 26 | PhenoPLIER maps gene-trait associations and drug-induced transcriptional responses into a common latent representation. 27 | For this, we integrate thousands of gene-trait associations (using TWAS from PhenomeXcan [@doi:10.1126/sciadv.aba2083]) and transcriptional profiles of drugs (from LINCS L1000 [@doi:10.1016/j.cell.2017.10.049]) into a low-dimensional space learned from public gene expression data on tens of thousands of RNA-seq samples (recount2 [@doi:10.1016/j.cels.2019.04.003; @doi:10.1038/nbt.3838]). 28 | We use a latent representation defined by a matrix factorization approach [@doi:10.1038/s41592-019-0456-1; @doi:10.1016/j.cels.2019.04.003] that extracts gene modules with certain sparsity constraints and preferences for those that align with prior knowledge (pathways). 29 | When mapping gene-trait associations to this reduced expression space, we observe that diseases are significantly associated with gene modules expressed in relevant cell types: such as hypothyroidism with T cells, corneal endothelial cells with keratometry measurements, hematological assays on specific blood cell types, plasma lipids with adipose tissue, and neuropsychiatric disorders with different brain cell types. 30 | Moreover, since PhenoPLIER can use models derived from large and heterogeneous RNA-seq datasets, we can also identify modules associated with cell types under specific stimuli or disease states. 31 | We observe that significant module-trait associations in PhenomeXcan (our discovery cohort) replicated in the Electronic Medical Records and Genomics (eMERGE) network phase III [@doi:10.1038/gim.2013.72; @doi:10.1101/2021.10.21.21265225] (our replication cohort). 32 | Furthermore, we perform a CRISPR screen to analyze lipid regulation in HepG2 cells. 33 | We observe more robust trait associations with modules than with individual genes, even when single genes known to be involved in lipid metabolism did not reach genome-wide significance. 34 | Compared to a single-gene approach, our module-based method also better predicts FDA-approved drug-disease links by capturing tissue-specific pathophysiological mechanisms linked with the mechanism of action of drugs (e.g., niacin with cardiovascular traits via a known immune mechanism). 35 | This improved drug-disease prediction suggests that modules may provide a better means to examine drug-disease relationships than individual genes. 36 | Finally, exploring the phenotype-module space reveals stable trait clusters associated with relevant tissues, including a complex branch involving lipids with cardiovascular, autoimmune, and neuropsychiatric disorders. 37 | In summary, instead of considering single genes associated with different complex traits, PhenoPLIER incorporates groups of genes that act together to carry out different functions in specific cell types. 38 | This approach improves robustness in detecting and interpreting genetic associations, and here we show how it can prioritize alternative and potentially more promising candidate targets even when known single gene associations are not detected. 39 | The approach represents a conceptual shift in the interpretation of genetic studies. 40 | It has the potential to extract mechanistic insight from statistical associations to enhance the understanding of complex diseases and their therapeutic modalities. 41 | -------------------------------------------------------------------------------- /tests/manuscripts/ccc/06.discussion.md: -------------------------------------------------------------------------------- 1 | ## Discussion 2 | 3 | We introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear machine learning-based statistic. 4 | Applying CCC to GTEx v8 revealed that it was robust to outliers and detected linear relationships as well as complex and biologically meaningful patterns that standard coefficients missed. 5 | In particular, CCC alone detected gene pairs with complex nonlinear patterns from the sex chromosomes, highlighting the way that not-only-linear coefficients can play in capturing sex-specific differences. 6 | The ability to capture these nonlinear patterns, however, extends beyond sex differences: it provides a powerful approach to detect complex relationships where a subset of samples or conditions are explained by other factors (such as differences between health and disease). 7 | We found that top CCC-ranked gene pairs in whole blood from GTEx were replicated in independent tissue-specific networks trained from multiple data types and attributed to cell lineages from blood, even though CCC did not have access to any cell lineage-specific information. 8 | This suggests that CCC can disentangle intricate cell lineage-specific transcriptional patterns missed by linear-only coefficients. 9 | In addition to capturing nonlinear patterns, the CCC was more similar to Spearman than Pearson, highlighting their shared robustness to outliers. 10 | The CCC results were concordant with MIC, but much faster to compute and thus practical for large datasets. 11 | Another advantage over MIC is that CCC can also process categorical variables together with numerical values. 12 | CCC is conceptually easy to interpret and has a single parameter that controls the maximum complexity of the detected relationships while also balancing compute time. 13 | 14 | 15 | Datasets such as Anscombe or "Datasaurus" highlight the value of visualization instead of relying on simple data summaries. 16 | While visual analysis is helpful, for many datasets examining each possible relationship is infeasible, and this is where more sophisticated and robust correlation coefficients are necessary. 17 | Advanced yet interpretable coefficients like CCC can focus human interpretation on patterns that are more likely to reflect real biology. 18 | The complexity of these patterns might reflect heterogeneity in samples that mask clear relationships between variables. 19 | For example, genes *UTY* - *KDM6A* (from sex chromosomes), detected by CCC, have a strong linear relationship but only in a subset of samples (males), which was not captured by linear-only coefficients. 20 | This example, in particular, highlights the importance of considering sex as a biological variable (SABV) [@doi:10.1038/509282a] to avoid overlooking important differences between men and women, for instance, in disease manifestations [@doi:10.1210/endrev/bnaa034; @doi:10.1038/s41593-021-00806-8]. 21 | More generally, a not-only-linear correlation coefficient like CCC could identify significant differences between variables (such as genes) that are explained by a third factor (beyond sex differences), that would be entirely missed by linear-only coefficients. 22 | 23 | 24 | It is well-known that biomedical research is biased towards a small fraction of human genes [@pmid:17620606; @pmid:17472739]. 25 | Some genes highlighted in CCC-ranked pairs (Figure @fig:upsetplot_coefs b), such as *SDS* (12q24) and *ZDHHC12* (9q34), were previously found to be the focus of fewer than expected publications [@pmid:30226837]. 26 | It is possible that the widespread use of linear coefficients may bias researchers away from genes with complex coexpression patterns. 27 | A beyond-linear gene co-expression analysis on large compendia might shed light on the function of understudied genes. 28 | For example, gene *KLHL21* (1p36) and *AC068580.6* (*ENSG00000235027*, in 11p15) have a high CCC value and are missed by the other coefficients. 29 | *KLHL21* was suggested as a potential therapeutic target for hepatocellular carcinoma [@pmid:27769251] and other cancers [@pmid:29574153; @pmid:35084622]. 30 | Its nonlinear correlation with *AC068580.6* might unveil other important players in cancer initiation or progression, potentially in subsets of samples with specific characteristics (as suggested in Figure @fig:upsetplot_coefs b). 31 | 32 | 33 | Not-only-linear correlation coefficients might also be helpful in the field of genetic studies. 34 | In this context, genome-wide association studies (GWAS) have been successful in understanding the molecular basis of common diseases by estimating the association between genotype and phenotype [@doi:10.1016/j.ajhg.2017.06.005]. 35 | However, the estimated effect sizes of genes identified with GWAS are generally modest, and they explain only a fraction of the phenotype variance, hampering the clinical translation of these findings [@doi:10.1038/s41576-019-0127-1]. 36 | Recent theories, like the omnigenic model for complex traits [@pmid:28622505; @pmid:31051098], argue that these observations are explained by highly-interconnected gene regulatory networks, with some core genes having a more direct effect on the phenotype than others. 37 | Using this omnigenic perspective, we and others [@doi:10.1101/2021.07.05.450786; @doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.10.21.21265342] have shown that integrating gene co-expression networks in genetic studies could potentially identify core genes that are missed by linear-only models alone like GWAS. 38 | Our results suggest that building these networks with more advanced and efficient correlation coefficients could better estimate gene co-expression profiles and thus more accurately identify these core genes. 39 | Approaches like CCC could play a significant role in the precision medicine field by providing the computational tools to focus on more promising genes representing potentially better candidate drug targets. 40 | 41 | 42 | Our analyses have some limitations. 43 | We worked on a sample with the top variable genes to keep computation time feasible. 44 | Although CCC is much faster than MIC, Pearson and Spearman are still the most computationally efficient since they only rely on simple data statistics. 45 | Our results, however, reveal the advantages of using more advanced coefficients like CCC for detecting and studying more intricate molecular mechanisms that replicated in independent datasets. 46 | The application of CCC on larger compendia, such as recount3 [@pmid:34844637] with thousands of heterogeneous samples across different conditions, can reveal other potentially meaningful gene interactions. 47 | The single parameter of CCC, $k_{\mathrm{max}}$, controls the maximum complexity of patterns found and also impacts the compute time. 48 | Our analysis suggested that $k_{\mathrm{max}}=10$ was sufficient to identify both linear and more complex patterns in gene expression. 49 | A more comprehensive analysis of optimal values for this parameter could provide insights to adjust it for different applications or data types. 50 | 51 | 52 | While linear and rank-based correlation coefficients are exceptionally fast to calculate, not all relevant patterns in biological datasets are linear. 53 | For example, patterns associated with sex as a biological variable are not apparent to the linear-only coefficients that we evaluated but are revealed by not-only-linear methods. 54 | Beyond sex differences, being able to use a method that inherently identifies patterns driven by other factors is likely to be desirable. 55 | Not-only-linear coefficients can also disentangle intricate yet relevant patterns from expression data alone that were replicated in models integrating different data modalities. 56 | CCC, in particular, is highly parallelizable, and we anticipate efficient GPU-based implementations that could make it even faster. 57 | The CCC is an efficient, next-generation correlation coefficient that is highly effective in transcriptome analyses and potentially useful in a broad range of other domains. 58 | --------------------------------------------------------------------------------