├── tests
├── utils
│ ├── __init__.py
│ ├── dir_union_fixtures
│ │ ├── original
│ │ │ └── test.txt
│ │ └── patched
│ │ │ ├── sub
│ │ │ └── third.txt
│ │ │ └── another.txt
│ ├── test_dir_union.py
│ └── dir_union.py
├── manuscripts
│ ├── phenoplier_full
│ │ ├── ci
│ │ │ └── .gitkeep
│ │ └── content
│ │ │ ├── 10.references.md
│ │ │ ├── 04.00.results.md
│ │ │ ├── 01.abstract.md
│ │ │ ├── 15.acknowledgements.md
│ │ │ ├── 00.front-matter.md
│ │ │ ├── 04.05.01.crispr.md
│ │ │ ├── metadata.yaml
│ │ │ └── 02.introduction.md
│ ├── ccc
│ │ ├── 04.00.results.md
│ │ ├── 15.acknowledgements.md
│ │ ├── 10.references.md
│ │ ├── 08.05.methods.data.md
│ │ ├── 08.20.methods.mic.md
│ │ ├── 01.abstract.md
│ │ ├── metadata.yaml
│ │ ├── 00.front-matter.md
│ │ ├── 08.15.methods.giant.md
│ │ ├── 08.01.methods.ccc.md
│ │ ├── 04.05.results_intro.md
│ │ ├── 02.introduction.md
│ │ ├── 04.10.results_comp.md
│ │ ├── 04.12.results_giant.md
│ │ └── 06.discussion.md
│ ├── phenoplier_full_only_first_para
│ │ ├── ci
│ │ │ └── .gitkeep
│ │ └── content
│ │ │ ├── 10.references.md
│ │ │ ├── 04.00.results.md
│ │ │ ├── 07.00.methods.md
│ │ │ ├── 04.05.01.crispr.md
│ │ │ ├── 50.00.supplementary_material.md
│ │ │ ├── 15.acknowledgements.md
│ │ │ ├── 05.discussion.md
│ │ │ ├── 04.05.00.results_framework.md
│ │ │ ├── 02.introduction.md
│ │ │ ├── 01.abstract.md
│ │ │ ├── 04.15.drug_disease_prediction.md
│ │ │ ├── 04.20.00.traits_clustering.md
│ │ │ ├── 00.front-matter.md
│ │ │ └── metadata.yaml
│ ├── gbk_encoded
│ │ ├── 01.abstract.md
│ │ └── metadata.yaml
│ ├── mutator-epistasis
│ │ ├── 90.back-matter.md
│ │ ├── 06.acknowledgments.md
│ │ ├── manual-references.json
│ │ ├── metadata.yaml
│ │ ├── 01.abstract.md
│ │ ├── 00.front-matter.md
│ │ └── 02.introduction.md
│ ├── custom
│ │ ├── metadata.yaml
│ │ ├── 00.results_image_with_no_caption.md
│ │ └── 00.results_table_below_nonended_paragraph.md
│ ├── ccc_non_standard_filenames
│ │ ├── 01.ab.md
│ │ ├── metadata.yaml
│ │ ├── 00.front-matter.md
│ │ ├── 08.01.meths.md
│ │ ├── 04.05.res.md
│ │ └── 02.beginning.md
│ └── phenoplier
│ │ ├── 50.00.supplementary_material.md
│ │ ├── 50.01.supplementary_material.md
│ │ └── metadata.yaml
├── config_loader_fixtures
│ ├── single_generic_prompt
│ │ ├── ai-revision-prompts.yaml
│ │ └── ai-revision-config.yaml
│ ├── phenoplier_full
│ │ ├── ai-revision-prompts.yaml
│ │ └── ai-revision-config.yaml
│ ├── both_prompts_config
│ │ ├── ai-revision-config.yaml
│ │ └── ai-revision-prompts.yaml
│ ├── conflicting_promptsfiles_matchings
│ │ ├── ai-revision-config.yaml
│ │ └── ai-revision-prompts.yaml
│ ├── prompt_propogation
│ │ ├── ai-revision-prompts.yaml
│ │ └── ai-revision-config.yaml
│ ├── prompt_gpt3_e2e
│ │ ├── ai-revision-config.yaml
│ │ └── ai-revision-prompts.yaml
│ └── only_revision_prompts
│ │ └── ai-revision-prompts.yaml
├── provider_fixtures
│ ├── provider_model_engines.json
│ └── refresh_model_engines.py
├── conftest.py
└── test_model_providers.py
├── libs
└── manubot_ai_editor
│ ├── __init__.py
│ ├── exceptions.py
│ ├── utils.py
│ └── env_vars.py
├── environment.yml
├── .github
├── release-drafter.yml
└── workflows
│ ├── draft-release.yml
│ ├── publish-pypi.yml
│ └── run-tests.yml
├── pyproject.toml
├── LICENSE.md
├── .pre-commit-config.yaml
├── CITATION.cff
├── .gitignore
├── docs
├── env-vars.md
└── custom-prompts.md
└── CODE_OF_CONDUCT.md
/tests/utils/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/ci/.gitkeep:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/04.00.results.md:
--------------------------------------------------------------------------------
1 | ## Results
2 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/ci/.gitkeep:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/tests/utils/dir_union_fixtures/original/test.txt:
--------------------------------------------------------------------------------
1 | hello, world!
--------------------------------------------------------------------------------
/tests/utils/dir_union_fixtures/patched/sub/third.txt:
--------------------------------------------------------------------------------
1 | a third file
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/15.acknowledgements.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tests/utils/dir_union_fixtures/patched/another.txt:
--------------------------------------------------------------------------------
1 | patched in via unify mock
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/single_generic_prompt/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts_files:
2 | \.md$: |
3 | Proofread the following paragraph
4 |
--------------------------------------------------------------------------------
/tests/manuscripts/gbk_encoded/01.abstract.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/manubot/manubot-ai-editor/HEAD/tests/manuscripts/gbk_encoded/01.abstract.md
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/10.references.md:
--------------------------------------------------------------------------------
1 | ## References {.page_break_before}
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/90.back-matter.md:
--------------------------------------------------------------------------------
1 | ## References {.page_break_before}
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/10.references.md:
--------------------------------------------------------------------------------
1 | ## References {.page_break_before}
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/single_generic_prompt/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | ignore:
3 | - front\-matter
4 | - back\-matter
5 | - response\-to\-reviewers
6 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/10.references.md:
--------------------------------------------------------------------------------
1 | ## References {.page_break_before}
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/libs/manubot_ai_editor/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Init file for manubot_ai_editor
3 | """
4 |
5 | # note: version data is maintained by poetry-dynamic-versioning (do not edit)
6 | __version__ = "0.0.0"
7 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: manubot-ai-editor
2 | channels:
3 | - conda-forge
4 | - defaults
5 | dependencies:
6 | - openai==0.28
7 | - pip
8 | - pytest=7.*
9 | - python=3.10.*
10 | - pyyaml=6.*
11 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/04.00.results.md:
--------------------------------------------------------------------------------
1 | ## Results
2 |
3 |
11 |
--------------------------------------------------------------------------------
/tests/manuscripts/gbk_encoded/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "A Chinese hello, world; 你好,世界"
3 | keywords:
4 | - encoding
5 | lang: zh-CN
6 | authors:
7 | - name: 姓名
8 | initials: 姓名
9 | orcid: 0000-0000-0000-0000
10 | twitter: 姓名
11 | email: 姓名@姓名.edu
12 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/06.acknowledgments.md:
--------------------------------------------------------------------------------
1 | ## Acknowledgments
2 |
3 | We thank Robert W. Williams (University of Tennessee Health Sciences Center) and Don F. Conrad (Oregon Health & Science University) for very helpful comments and feedback on a draft of this manuscript.
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/04.00.results.md:
--------------------------------------------------------------------------------
1 | ## Results
2 |
3 |
11 |
--------------------------------------------------------------------------------
/libs/manubot_ai_editor/exceptions.py:
--------------------------------------------------------------------------------
1 | """
2 | Exception classes that are shared across modules in the project.
3 | """
4 |
5 |
6 | class APIKeyInvalidError(Exception):
7 | """
8 | Raised when a provider request is attempted with an invalid API key.
9 | """
10 |
11 | pass
12 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/phenoplier_full/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts:
2 | abstract: |
3 | Test match abstract.
4 |
5 | introduction_discussion: |
6 | Test match introduction or discussion.
7 |
8 | results: |
9 | Test match results.
10 |
11 | methods: |
12 | Test match methods.
13 |
14 | my_default_prompt: |
15 | default prompt text
16 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/manual-references.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "id": "url:https://github.com/manubot/rootstock",
4 | "type": "webpage",
5 | "URL": "https://github.com/manubot/rootstock",
6 | "title": "manubot/rootstock GitHub repository",
7 | "container-title": "GitHub",
8 | "issued": {
9 | "date-parts": [
10 | [
11 | 2019
12 | ]
13 | ]
14 | },
15 | "author": [
16 | {
17 | "given": "Daniel",
18 | "family": "Himmelstein"
19 | }
20 | ]
21 | }
22 | ]
23 |
--------------------------------------------------------------------------------
/.github/release-drafter.yml:
--------------------------------------------------------------------------------
1 | ---
2 | # template configuration for release-drafter
3 | # see: https://github.com/release-drafter/release-drafter
4 | name-template: 'v$RESOLVED_VERSION'
5 | tag-template: 'v$RESOLVED_VERSION'
6 | version-resolver:
7 | major:
8 | labels:
9 | - 'release-major'
10 | minor:
11 | labels:
12 | - 'release-minor'
13 | patch:
14 | labels:
15 | - 'release-patch'
16 | default: patch
17 | change-template: '- $TITLE (@$AUTHOR via #$NUMBER)'
18 | template: |
19 | ## Changes
20 |
21 | $CHANGES
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/08.05.methods.data.md:
--------------------------------------------------------------------------------
1 | ### Gene expression data and preprocessing {#sec:data_gtex}
2 |
3 | We downloaded GTEx v8 data for all tissues, normalized using TPM (transcripts per million), and focused our primary analysis on whole blood, which has a good sample size (755).
4 | We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $log(x + 1)$ to avoid a bias towards highly-expressed genes.
5 | We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all 755 samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000.
6 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/both_prompts_config/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | matchings:
3 | - files:
4 | - abstract
5 | prompt: abstract
6 | - files:
7 | - introduction
8 | prompt: introduction_discussion
9 | - files:
10 | - 04\..+\.md
11 | prompt: results
12 | - files:
13 | - discussion
14 | prompt: introduction_discussion
15 | - files:
16 | - methods
17 | prompt: methods
18 |
19 | default_prompt: default
20 |
21 | ignore:
22 | - front-matter
23 | - acknowledgements
24 | - supplementary_material
25 | - references
26 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/phenoplier_full/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | matchings:
3 | - files:
4 | - abstract
5 | prompt: abstract
6 | - files:
7 | - introduction
8 | prompt: introduction_discussion
9 | - files:
10 | - 04\..+\.md
11 | prompt: results
12 | - files:
13 | - discussion
14 | prompt: introduction_discussion
15 | - files:
16 | - methods
17 | prompt: methods
18 |
19 | default_prompt: my_default_prompt
20 |
21 | ignore:
22 | - front\-matter
23 | - acknowledgements
24 | - supplementary_material
25 | - references
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/conflicting_promptsfiles_matchings/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | matchings:
3 | - files:
4 | - abstract
5 | prompt: abstract
6 | - files:
7 | - introduction
8 | prompt: introduction_discussion
9 | - files:
10 | - 04\..+\.md
11 | prompt: results
12 | - files:
13 | - discussion
14 | prompt: introduction_discussion
15 | - files:
16 | - methods
17 | prompt: methods
18 |
19 | default_prompt: default
20 |
21 | ignore:
22 | - front-matter
23 | - acknowledgements
24 | - supplementary_material
25 | - references
26 |
--------------------------------------------------------------------------------
/.github/workflows/draft-release.yml:
--------------------------------------------------------------------------------
1 | ---
2 | # workflow for drafting releases on GitHub
3 | # see: https://github.com/release-drafter/release-drafter
4 | name: release drafter
5 |
6 | on:
7 | push:
8 | branches:
9 | - main
10 |
11 | jobs:
12 | draft_release:
13 | permissions:
14 | # write permission is required to create a github release
15 | contents: write
16 | # write permission is required for autolabeler
17 | # otherwise, read permission is required at least
18 | pull-requests: write
19 | runs-on: ubuntu-latest
20 | steps:
21 | - uses: release-drafter/release-drafter@v6
22 | env:
23 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/08.20.methods.mic.md:
--------------------------------------------------------------------------------
1 | ### Maximal Information Coefficient (MIC) {#sec:methods:mic}
2 |
3 | We used the Python package `minepy` [@doi:10.1093/bioinformatics/bts707; @url:https://github.com/minepy/minepy] (version 1.2.5) to estimate the MIC coefficient.
4 | In GTEx v8 (whole blood), we used MICe (an improved implementation of the original MIC introduced in [@Reshef2016]) with the default parameters `alpha=0.6`, `c=15` and `estimator='mic_e'`.
5 | We used the `pairwise_distances` function from `scikit-learn` [@Sklearn2011] to parallelize the computation of MIC on GTEx.
6 | For our computational complexity analyses (see [Supplementary Material](#sec:time_test)), we ran the original MIC (using parameter `estimator='mic_approx'`) and MICe (`estimator='mic_e'`).
7 |
--------------------------------------------------------------------------------
/tests/manuscripts/custom/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Manuscript Title"
3 | date: null # Defaults to date generated, but can specify like '2022-10-31'.
4 | keywords:
5 | - markdown
6 | - publishing
7 | - manubot
8 | lang: en-US
9 | authors:
10 | - github: johndoe
11 | name: John Doe
12 | initials: JD
13 | orcid: XXXX-XXXX-XXXX-XXXX
14 | twitter: johndoe
15 | email: john.doe@something.com
16 | affiliations:
17 | - Department of Something, University of Whatever
18 | funders:
19 | - Grant XXXXXXXX
20 | - github: janeroe
21 | name: Jane Roe
22 | initials: JR
23 | orcid: XXXX-XXXX-XXXX-XXXX
24 | email: jane.roe@whatever.edu
25 | affiliations:
26 | - Department of Something, University of Whatever
27 | - Department of Whatever, University of Something
28 | corresponding: true
29 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/prompt_propogation/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts:
2 | front_matter: This is the front-matter prompt
3 | abstract: This is the abstract prompt
4 | introduction: "This is the introduction prompt for the paper titled '{title}'."
5 | results: This is the results prompt
6 | results_framework: This is the results_framework prompt
7 | crispr: This is the crispr prompt
8 | drug_disease_prediction: This is the drug_disease_prediction prompt
9 | traits_clustering: This is the traits_clustering prompt
10 | discussion: This is the discussion prompt
11 | methods: This is the methods prompt
12 | references: This is the references prompt
13 | acknowledgements: This is the acknowledgements prompt
14 | supplementary_material: This is the supplementary_material prompt
15 |
16 | default: |
17 | This is the default prompt
18 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/07.00.methods.md:
--------------------------------------------------------------------------------
1 | ## Methods {#sec:methods}
2 |
3 | PhenoPLIER is a framework that combines different computational approaches to integrate gene-trait associations and drug-induced transcriptional responses with groups of functionally-related genes (referred to as gene modules or latent variables/LVs).
4 | Gene-trait associations are computed using the PrediXcan family of methods, whereas latent variables are inferred by the MultiPLIER models applied on large gene expression compendia.
5 | PhenoPLIER provides
6 | 1) a regression model to compute an LV-trait association,
7 | 2) a consensus clustering approach applied to the latent space to learn shared and distinct transcriptomic properties between traits, and
8 | 3) an interpretable, LV-based drug repurposing framework.
9 | We provide the details of these methods below.
10 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/04.05.01.crispr.md:
--------------------------------------------------------------------------------
1 | ### LVs link genes that alter lipid accumulation with relevant traits and tissues
2 |
3 | Our first experiment attempted to answer whether genes in a disease-relevant LV could represent potential therapeutic targets.
4 | For this, the first step was to obtain a set of genes strongly associated with a phenotype of interest.
5 | Therefore, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation ([Methods](#sec:methods:crispr)).
6 | From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids:
7 | a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*;
8 | and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary Data 2).
9 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/50.00.supplementary_material.md:
--------------------------------------------------------------------------------
1 | \clearpage
2 |
3 | ## Supplementary information {.page_break_before}
4 |
5 | ### Supplementary Note 1: mean type I error rates and calibration of LV-based regression model {#sm:reg:null_sim}
6 |
7 | We assessed our GLS model type I error rates (proportion of $p$-values below 0.05) and calibration using a null model of random traits and genotype data from 1000 Genomes Phase III.
8 | We selected 312 individuals with European ancestry, and then analyzed 1,000 traits drawn from a standard normal distribution $\mathcal{N}(0,1)$.
9 | We ran all the standard procedures for the TWAS approaches (S-PrediXcan and S-MultiXcan), including:
10 | 1) a standard GWAS using linear regression under an additive genetic model,
11 | 2) different GWAS processing steps, including harmonization and imputation procedures as defined in [@doi:10.1002/gepi.22346],
12 | 3) S-PrediXcan and S-MultiXcan analyses.
13 | Below we provide details for each of these steps.
14 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/15.acknowledgements.md:
--------------------------------------------------------------------------------
1 | ## Acknowledgements
2 |
3 | This study was funded by:
4 | the Gordon and Betty Moore Foundation (GBMF 4552 to C.S. Greene; GBMF 4560 to B.D. Sullivan),
5 | the National Human Genome Research Institute (R01 HG010067 to C.S. Greene, S.F.A. Grant and B.D. Sullivan; K99 HG011898 and R00 HG011898 to M. Pividori; U01 HG011181 to W. Wei),
6 | the National Cancer Institute (R01 CA237170 to C.S. Greene),
7 | the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765 to C.S. Greene),
8 | the National Institute of Aging (R01AG069900 to W. Wei),
9 | the National Institute of General Medical Sciences (R01 GM139891 to W. Wei);
10 | the National Heart, Lung, and Blood Institute (R01 HL163854 to Q. Feng);
11 | the National Institute of Diabetes and Digestive and Kidney Diseases (DK126194 to B.F. Voight);
12 | the Daniel B. Burke Endowed Chair for Diabetes Research to S.F.A. Grant;
13 | the Robert L. McNeil Jr. Endowed Fellowship in Translational Medicine and Therapeutics to C. Skarke.
14 |
--------------------------------------------------------------------------------
/tests/manuscripts/custom/00.results_image_with_no_caption.md:
--------------------------------------------------------------------------------
1 | ## Results
2 |
3 | This is the revision of the first paragraph of the introduction of CCC.
4 | This is the revision of the first paragraph of the introduction of CCC.
5 | This is the revision of the first paragraph of the introduction of CCC.
6 | This is the revision of the first paragraph of the introduction of CCC.
7 | This is the revision of the first paragraph of the introduction of CCC.
8 | This is the revision of the first paragraph of the introduction of CCC.
9 | This is the revision of the first paragraph of the introduction of CCC.
10 | This is the revision of the first paragraph of the introduction of CCC:
11 |
12 | {width="100%"}
14 |
15 | The tool, again, significantly revised the text, producing a much better and more concise introductory paragraph.
16 | For example, the revised first sentence (on the right) incorportes the ideas of "large datasets", and the "opportunities/possibilities" for "scientific exploration" in a clearly and briefly.
17 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/prompt_gpt3_e2e/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | matchings:
3 | - files:
4 | - front-matter
5 | prompt: front_matter
6 | - files:
7 | - abstract
8 | prompt: abstract
9 | - files:
10 | - introduction
11 | prompt: introduction
12 | - files:
13 | - results_framework
14 | prompt: results_framework
15 | - files:
16 | - results
17 | prompt: results
18 | - files:
19 | - crispr
20 | prompt: crispr
21 | - files:
22 | - drug_disease_prediction
23 | prompt: drug_disease_prediction
24 | - files:
25 | - traits_clustering
26 | prompt: traits_clustering
27 | - files:
28 | - discussion
29 | prompt: discussion
30 | - files:
31 | - methods
32 | prompt: methods
33 | - files:
34 | - references
35 | prompt: references
36 | - files:
37 | - acknowledgements
38 | prompt: acknowledgements
39 | - files:
40 | - supplementary_material
41 | prompt: supplementary_material
42 |
43 | default_prompt: default
44 |
45 | ignore:
46 | - results
47 | - references
48 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/prompt_propogation/ai-revision-config.yaml:
--------------------------------------------------------------------------------
1 | files:
2 | matchings:
3 | - files:
4 | - front-matter
5 | prompt: front_matter
6 | - files:
7 | - abstract
8 | prompt: abstract
9 | - files:
10 | - introduction
11 | prompt: introduction
12 | - files:
13 | - results_framework
14 | prompt: results_framework
15 | - files:
16 | - results
17 | prompt: results
18 | - files:
19 | - crispr
20 | prompt: crispr
21 | - files:
22 | - drug_disease_prediction
23 | prompt: drug_disease_prediction
24 | - files:
25 | - traits_clustering
26 | prompt: traits_clustering
27 | - files:
28 | - discussion
29 | prompt: discussion
30 | - files:
31 | - methods
32 | prompt: methods
33 | - files:
34 | - references
35 | prompt: references
36 | - files:
37 | - acknowledgements
38 | prompt: acknowledgements
39 | - files:
40 | - supplementary_material
41 | prompt: supplementary_material
42 |
43 | default_prompt: default
44 |
45 | ignore:
46 | - results
47 | - references
48 |
--------------------------------------------------------------------------------
/.github/workflows/publish-pypi.yml:
--------------------------------------------------------------------------------
1 | ---
2 | # used for publishing packages to pypi on release
3 | name: publish pypi release
4 |
5 | on:
6 | release:
7 | types:
8 | - published
9 |
10 | jobs:
11 | publish_pypi:
12 | runs-on: ubuntu-latest
13 | environment: release
14 | permissions:
15 | # IMPORTANT: this permission is mandatory for trusted publishing
16 | id-token: write
17 | steps:
18 | - name: Checkout
19 | uses: actions/checkout@v4
20 | with:
21 | fetch-depth: 0
22 | - name: Fetch tags
23 | run: git fetch --all --tags
24 | - name: Python setup
25 | uses: actions/setup-python@v5
26 | with:
27 | python-version: "3.11"
28 | - name: Setup for poetry
29 | run: |
30 | python -m pip install poetry
31 | poetry self add "poetry-dynamic-versioning[plugin]"
32 | - name: Install environment
33 | run: poetry install --no-interaction --no-ansi
34 | - name: poetry build distribution content
35 | run: poetry build
36 | - name: Publish package distributions to PyPI
37 | uses: pypa/gh-action-pypi-publish@release/v1
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/05.discussion.md:
--------------------------------------------------------------------------------
1 | ## Discussion
2 |
3 | We have introduced a novel computational strategy that integrates statistical associations from TWAS with groups of genes (gene modules) that have similar expression patterns across the same cell types.
4 | Our key innovation is that we project gene-trait associations through a latent representation derived not strictly from measures of normal tissue but also from cell types under a variety of stimuli and at various developmental stages.
5 | This improves interpretation by going beyond statistical associations to infer cell type-specific features of complex phenotypes.
6 | Our approach can identify disease-relevant cell types from summary statistics, and several disease-associated gene modules were replicated in eMERGE.
7 | Using a CRISPR screen to analyze lipid regulation, we found that our gene module-based approach can prioritize causal genes even when single gene associations are not detected.
8 | We interpret these findings with an omnigenic perspective of "core" and "peripheral" genes, suggesting that the approach can identify genes that directly affect the trait with no mediated regulation of other genes and thus prioritize alternative and potentially more attractive therapeutic targets.
9 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/04.05.00.results_framework.md:
--------------------------------------------------------------------------------
1 | ### PhenoPLIER: an integration framework based on gene co-expression patterns
2 |
3 | PhenoPLIER is a flexible computational framework that combines gene-trait and gene-drug associations with gene modules expressed in specific contexts (Figure {@fig:entire_process}a).
4 | The approach uses a latent representation (with latent variables or LVs representing gene modules) derived from a large gene expression compendium (Figure {@fig:entire_process}b, top) to integrate TWAS with drug-induced transcriptional responses (Figure {@fig:entire_process}b, middle) for a joint analysis.
5 | The approach consists in three main components (Figure {@fig:entire_process}b, bottom, see [Methods](#sec:methods)):
6 | 1) an LV-based regression model to compute an association between an LV and a trait,
7 | 2) a clustering framework to learn groups of traits with shared transcriptomic properties,
8 | and 3) an LV-based drug repurposing approach that links diseases to potential treatments.
9 | We performed extensive simulations for our regression model ([Supplementary Note 1](#sm:reg:null_sim)) and clustering framework ([Supplementary Note 2](#sm:clustering:null_sim)) to ensure proper calibration and expected results under a model of no association.
10 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/02.introduction.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | Genes work together in context-specific networks to carry out different functions [@pmid:19104045; @doi:10.1038/ng.3259].
4 | Variations in these genes can change their functional role and, at a higher level, affect disease-relevant biological processes [@doi:10.1038/s41467-018-06022-6].
5 | In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776; @doi:10.1038/s41586-020-2559-3; @doi:10.1038/s41576-019-0200-9], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489].
6 | Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787; @doi:10.1038/s41586-020-03145-z; @doi:10.1038/s41586-020-2559-3] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776].
7 | Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4; @doi:10.1038/ncomms6890] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105].
8 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/01.abstract.md:
--------------------------------------------------------------------------------
1 | ## Abstract {.page_break_before}
2 |
3 | Correlation coefficients are widely used to identify patterns in data that may be of particular interest.
4 | In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes.
5 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models.
6 | CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.
7 | CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.
8 | When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients.
9 | Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed.
10 | CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.
11 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/01.ab.md:
--------------------------------------------------------------------------------
1 | ## Abstract {.page_break_before}
2 |
3 | Correlation coefficients are widely used to identify patterns in data that may be of particular interest.
4 | In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes.
5 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models.
6 | CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.
7 | CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.
8 | When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients.
9 | Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed.
10 | CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.
11 |
--------------------------------------------------------------------------------
/tests/manuscripts/custom/00.results_table_below_nonended_paragraph.md:
--------------------------------------------------------------------------------
1 | ## Results
2 |
3 | This is the revision of the first paragraph of the introduction of CCC.
4 | This is the revision of the first paragraph of the introduction of CCC.
5 | This is the revision of the first paragraph of the introduction of CCC.
6 | This is the revision of the first paragraph of the introduction of CCC.
7 | This is the revision of the first paragraph of the introduction of CCC.
8 | This is the revision of the first paragraph of the introduction of CCC.
9 | This is the revision of the first paragraph of the introduction of CCC.
10 | This is the revision of the first paragraph of the introduction of CCC:
11 |
12 | | Pathway | AUC | FDR |
13 | |:------------------------------------|:------|:---------|
14 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 |
15 | | SVM Neutrophils | 0.98 | 1.43e-09 |
16 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 |
17 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 |
18 |
19 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603}
20 |
21 | The tool, again, significantly revised the text, producing a much better and more concise introductory paragraph.
22 | For example, the revised first sentence (on the right) incorportes the ideas of "large datasets", and the "opportunities/possibilities" for "scientific exploration" in a clearly and briefly.
23 |
--------------------------------------------------------------------------------
/.github/workflows/run-tests.yml:
--------------------------------------------------------------------------------
1 | ---
2 | name: run tests
3 |
4 | on:
5 | push:
6 | branches: [main]
7 | pull_request:
8 | branches: [main]
9 |
10 | jobs:
11 | pre_commit_checks:
12 | runs-on: ubuntu-24.04
13 | steps:
14 | - uses: actions/checkout@v4
15 | - uses: actions/setup-python@v5
16 | with:
17 | python-version: "3.10"
18 | - uses: pre-commit/action@v3.0.1
19 | id: pre_commit
20 | # run pre-commit ci lite for automated fixes
21 | - uses: pre-commit-ci/lite-action@v1.1.0
22 | if: ${{ !cancelled() && steps.pre_commit.outcome == 'failure' }}
23 | tests:
24 | strategy:
25 | matrix:
26 | # matrixed execution for parallel gh-action performance increases
27 | python_version: ["3.10", "3.11", "3.12", "3.13"]
28 | os: [ubuntu-24.04, macos-14]
29 | runs-on: ${{ matrix.os }}
30 | env:
31 | OS: ${{ matrix.os }}
32 | steps:
33 | - name: Checkout
34 | uses: actions/checkout@v4
35 | - name: Python setup
36 | uses: actions/setup-python@v5
37 | with:
38 | python-version: ${{ matrix.python_version }}
39 | - name: Setup poetry
40 | run: |
41 | pip install poetry
42 | - name: Install poetry env
43 | run: |
44 | poetry install
45 | - name: Run pytest
46 | env:
47 | # set placeholder API key, required by tests
48 | PROVIDER_API_KEY: ABCD1234
49 | run: poetry run pytest
50 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Epistasis between mutator alleles contributes to germline mutation rate variability in laboratory mice"
3 | date: null # Defaults to date generated, but can specify like '2022-10-31'.
4 | keywords:
5 | - markdown
6 | - publishing
7 | - manubot
8 | lang: en-US
9 | authors:
10 | - name: Thomas A. Sasani
11 | github: tomsasani
12 | initials: TAS
13 | orcid: 0000-0003-2317-1374
14 | twitter: tomsasani
15 | email: thomas.a.sasani@gmail.com
16 | affiliations:
17 | - Department of Human Genetics, University of Utah
18 | - name: Aaron R. Quinlan
19 | initials: ARQ
20 | orcid: 0000-0003-1756-0859
21 | twitter: aaronquinlan
22 | email: aquinlan@genetics.utah.edu
23 | affiliations:
24 | - Department of Human Genetics, University of Utah
25 | - Department of Biomedical Informatics, University of Utah
26 | funders:
27 | - NIH/NHGRI R01HG012252
28 | corresponding: true
29 | - name: Kelley Harris
30 | initials: KH
31 | orcid: 0000-0003-0302-2523
32 | twitter: Kelley__Harris
33 | email: harriske@uw.edu
34 | affiliations:
35 | - Department of Genome Sciences, University of Washington
36 | funders:
37 | - NIH/NIGMS R35GM133428
38 | - Burroughs Wellcome Career Award at the Scientific Interface
39 | - Searle Scholarship
40 | - Pew Scholarship
41 | - Sloan Fellowship
42 | - Allen Discovery Center for Cell Lineage Tracing
43 | corresponding: true
44 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/01.abstract.md:
--------------------------------------------------------------------------------
1 | ## Abstract {.page_break_before}
2 |
3 | Genes act in concert with each other in specific contexts to perform their functions.
4 | Determining how these genes influence complex traits requires a mechanistic understanding of expression regulation across different conditions.
5 | It has been shown that this insight is critical for developing new therapies.
6 | Transcriptome-wide association studies have helped uncover the role of individual genes in disease-relevant mechanisms.
7 | However, modern models of the architecture of complex traits predict that gene-gene interactions play a crucial role in disease origin and progression.
8 | Here we introduce PhenoPLIER, a computational approach that maps gene-trait associations and pharmacological perturbation data into a common latent representation for a joint analysis.
9 | This representation is based on modules of genes with similar expression patterns across the same conditions.
10 | We observe that diseases are significantly associated with gene modules expressed in relevant cell types, and our approach is accurate in predicting known drug-disease pairs and inferring mechanisms of action.
11 | Furthermore, using a CRISPR screen to analyze lipid regulation, we find that functionally important players lack associations but are prioritized in trait-associated modules by PhenoPLIER.
12 | By incorporating groups of co-expressed genes, PhenoPLIER can contextualize genetic associations and reveal potential targets missed by single-gene strategies.
13 |
--------------------------------------------------------------------------------
/tests/utils/test_dir_union.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from unittest import mock
3 |
4 | from .dir_union import mock_unify_open, set_directory
5 |
6 | # tests for mock_unify_open
7 |
8 | UNIFY_TEST_DIR = Path(__file__).parent / "dir_union_fixtures"
9 | UNIFY_ORIG_DIR = UNIFY_TEST_DIR / "original"
10 | UNIFY_PATCHED_DIR = UNIFY_TEST_DIR / "patched"
11 |
12 |
13 | @mock.patch("builtins.open", mock_unify_open(UNIFY_ORIG_DIR, UNIFY_PATCHED_DIR))
14 | def test_unify_folder_mock():
15 | # test that we can still open files in the original folder
16 | with open(UNIFY_ORIG_DIR / "test.txt") as fp:
17 | assert fp.read().strip() == "hello, world!"
18 | # test that the patched folder takes precedence
19 | with open(UNIFY_ORIG_DIR / "another.txt") as fp:
20 | assert fp.read().strip() == "patched in via unify mock"
21 |
22 |
23 | @mock.patch("builtins.open", mock_unify_open(UNIFY_ORIG_DIR, UNIFY_PATCHED_DIR))
24 | def test_unify_folder_mock_relative_paths():
25 | with set_directory(UNIFY_ORIG_DIR):
26 | # test that we can still open files in the original folder
27 | with open("./test.txt") as fp:
28 | assert fp.read().strip() == "hello, world!"
29 | # test that the patched folder takes precedence
30 | with open("./another.txt") as fp:
31 | assert fp.read().strip() == "patched in via unify mock"
32 | # test that subfolders in the patched folder can be used
33 | with open("./sub/third.txt") as fp:
34 | assert fp.read().strip() == "a third file"
35 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/01.abstract.md:
--------------------------------------------------------------------------------
1 | ## Abstract {.page_break_before}
2 |
3 | Genes act in concert with each other in specific contexts to perform their functions.
4 | Determining how these genes influence complex traits requires a mechanistic understanding of expression regulation across different conditions.
5 | It has been shown that this insight is critical for developing new therapies.
6 | Transcriptome-wide association studies have helped uncover the role of individual genes in disease-relevant mechanisms.
7 | However, modern models of the architecture of complex traits predict that gene-gene interactions play a crucial role in disease origin and progression.
8 | Here we introduce PhenoPLIER, a computational approach that maps gene-trait associations and pharmacological perturbation data into a common latent representation for a joint analysis.
9 | This representation is based on modules of genes with similar expression patterns across the same conditions.
10 | We observe that diseases are significantly associated with gene modules expressed in relevant cell types, and our approach is accurate in predicting known drug-disease pairs and inferring mechanisms of action.
11 | Furthermore, using a CRISPR screen to analyze lipid regulation, we find that functionally important players lack associations but are prioritized in trait-associated modules by PhenoPLIER.
12 | By incorporating groups of co-expressed genes, PhenoPLIER can contextualize genetic associations and reveal potential targets missed by single-gene strategies.
13 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | build-backend = "poetry_dynamic_versioning.backend"
3 | requires = [ "poetry-core>=1", "poetry-dynamic-versioning>=1,<2" ]
4 |
5 | [tool.poetry]
6 | name = "manubot-ai-editor"
7 | # note: version data is maintained by poetry-dynamic-versioning (do not edit)
8 | version = "0.0.0"
9 | description = "A Manubot plugin to revise a manuscript using GPT-3"
10 | authors = [ "Milton Pividori " ]
11 | maintainers = [
12 | "Milton Pividori",
13 | "Faisal Alquaddoomi",
14 | "Vincent Rubinetti",
15 | "Dave Bunten",
16 | ]
17 | license = "BSD-3-Clause"
18 | readme = "README.md"
19 | repository = "https://github.com/manubot/manubot-ai-editor"
20 | homepage = "https://github.com/manubot/manubot-ai-editor"
21 | classifiers = [
22 | "Programming Language :: Python :: 3",
23 | "License :: OSI Approved :: BSD License",
24 | "Operating System :: OS Independent",
25 | ]
26 | packages = [ { include = "manubot_ai_editor", from = "libs" } ]
27 |
28 | [tool.poetry.dependencies]
29 | python = ">=3.10,<4.0"
30 | langchain-core = "^0.3.6"
31 | langchain-openai = "^0.2.0"
32 | langchain-anthropic = "^0.3.0"
33 | pyyaml = "*"
34 | charset-normalizer = "^3.4.1"
35 |
36 | [tool.poetry.group.dev.dependencies]
37 | pytest = ">=8.3.3"
38 | pytest-antilru = "^2.0.0"
39 |
40 | [tool.poetry.requires-plugins]
41 | poetry-dynamic-versioning = { version = ">=1.0.0,<2.0.0", extras = [ "plugin" ] }
42 |
43 | [tool.poetry-dynamic-versioning]
44 | enable = true
45 | style = "pep440"
46 | vcs = "git"
47 | substitution.files = [ "libs/manubot_ai_editor/__init__.py" ]
48 |
49 | [tool.setuptools_scm]
50 | root = "."
51 |
--------------------------------------------------------------------------------
/tests/utils/dir_union.py:
--------------------------------------------------------------------------------
1 | import os
2 | from pathlib import Path
3 |
4 | from contextlib import contextmanager
5 |
6 |
7 | @contextmanager
8 | def set_directory(new):
9 | """
10 | Given a path, sets it as the current working directory,
11 | then sets it back once the context has been exited.
12 |
13 | Note that if we upgrade to Python 3.11, this method can be replaced
14 | with https://docs.python.org/3/library/contextlib.html#contextlib.chdir
15 | """
16 |
17 | # store the current path so we can return to it
18 | original = Path().absolute()
19 |
20 | try:
21 | os.chdir(new)
22 | yield
23 | finally:
24 | os.chdir(original)
25 |
26 |
27 | def mock_unify_open(original, patched):
28 | """
29 | Given paths to an 'original' and 'patched' folder,
30 | patches open() to first check the patched folder for the
31 | target file, then checks the original folder if it's not found
32 | in the patched folder.
33 | """
34 | builtin_open = open
35 |
36 | def unify_open(*args, **kwargs):
37 | try:
38 | # first, try to open the file from within patched
39 |
40 | # resolve all paths: the original, patched, and requested file
41 | target_full_path = Path(args[0]).absolute()
42 | rewritten_path = str(target_full_path).replace(
43 | str(original.absolute()), str(patched.absolute())
44 | )
45 |
46 | return builtin_open(rewritten_path, *(args[1:]), **kwargs)
47 | except FileNotFoundError:
48 | # resort to opening it normally
49 | return builtin_open(*args, **kwargs)
50 |
51 | return unify_open
52 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | # BSD 3-Clause License
2 |
3 | Copyright (c) 2024, Contributors and the Pividori Lab at the University of Colorado Anschutz Medical Campus
4 |
5 | All rights reserved.
6 |
7 | Redistribution and use in source and binary forms, with or without
8 | modification, are permitted provided that the following conditions are met:
9 |
10 | 1. Redistributions of source code must retain the above copyright notice, this
11 | list of conditions and the following disclaimer.
12 |
13 | 2. Redistributions in binary form must reproduce the above copyright notice,
14 | this list of conditions and the following disclaimer in the documentation
15 | and/or other materials provided with the distribution.
16 |
17 | 3. Neither the name of the copyright holder nor the names of its
18 | contributors may be used to endorse or promote products derived from
19 | this software without specific prior written permission.
20 |
21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
22 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
25 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
27 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
28 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
29 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
30 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
31 |
--------------------------------------------------------------------------------
/libs/manubot_ai_editor/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import difflib
3 |
4 | import yaml
5 |
6 |
7 | SIMPLE_SENTENCE_END_PATTERN = re.compile(r"\.\s")
8 | SENTENCE_END_PATTERN = re.compile(r"\.\s(\S)")
9 |
10 |
11 | def get_yaml_field(yaml_file, field):
12 | """
13 | Returns the value of a field in a YAML file.
14 | """
15 | with open(yaml_file, "r") as f:
16 | data = yaml.safe_load(f)
17 | return data[field]
18 |
19 |
20 | def starts_with_similar(string: str, prefix: str, threshold: float = 0.8) -> bool:
21 | """
22 | Returns True if the string starts with a prefix that is similar to the given prefix.
23 | """
24 | return (
25 | difflib.SequenceMatcher(None, prefix, string[: len(prefix)]).ratio() > threshold
26 | )
27 |
28 |
29 | def get_obj_path(target: any, path: tuple, missing=None):
30 | """
31 | Traverse a nested object using a tuple of keys, returning the last resolved
32 | value in the path. If any key is not found, return 'missing' (default None).
33 |
34 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'c'))
35 | 1
36 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'd')) is None
37 | True
38 | >>> get_obj_path({'a': {'b': {'c': 1}}}, ('a', 'b', 'd'), missing=2)
39 | 2
40 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 1, 'c'))
41 | 1
42 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 1, 'd')) is None
43 | True
44 | >>> get_obj_path({'a': [100, {'c': 1}]}, ('a', 3)) is None
45 | True
46 | """
47 | try:
48 | for key in path:
49 | target = target[key]
50 | except (KeyError, IndexError, TypeError):
51 | return missing
52 |
53 | return target
54 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/04.15.drug_disease_prediction.md:
--------------------------------------------------------------------------------
1 | ### LVs predict drug-disease pairs better than single genes
2 |
3 | We next determined how substituting LVs for individual genes predicted known treatment-disease relationships.
4 | For this, we used the transcriptional responses to small molecule perturbations profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049], which were further processed and mapped to DrugBank IDs [@doi:10.1093/nar/gkt1068; @doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47223].
5 | Based on an established drug repurposing strategy that matches reversed transcriptome patterns between genes and drug-induced perturbations [@doi:10.1126/scitranslmed.3002648; @doi:10.1126/scitranslmed.3001318], we adopted a previously described framework that uses imputed transcriptomes from TWAS to prioritize drug candidates [@doi:10.1038/nn.4618].
6 | For this, we computed a drug-disease score by calculating the negative dot product between the $z$-scores for a disease (from TWAS) and the $z$-scores for a drug (from LINCS) across sets of genes of different sizes (see [Methods](#sec:methods:drug)).
7 | Therefore, a large score for a drug-disease pair indicated that higher (lower) predicted expression values of disease-associated genes are down (up)-regulated by the drug, thus predicting a potential treatment.
8 | Similarly, for the LV-based approach, we estimated how pharmacological perturbations affected the gene module activity by projecting expression profiles of drugs into our latent representation (Figure {@fig:entire_process}b).
9 | We used a manually-curated gold standard set of drug-disease medical indications [@doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47664] for 322 drugs across 53 diseases to evaluate the prediction performance.
10 |
--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
1 | default_language_version:
2 | python: python3.10
3 | repos:
4 | - repo: https://github.com/pre-commit/pre-commit-hooks
5 | rev: v6.0.0
6 | hooks:
7 | # Check for files that contain merge conflict strings.
8 | - id: check-merge-conflict
9 | # Check for debugger imports and py37+ `breakpoint()` calls in python source.
10 | - id: debug-statements
11 | # Replaces or checks mixed line ending
12 | - id: mixed-line-ending
13 | # Check for files that would conflict in case-insensitive filesystems
14 | - id: check-case-conflict
15 | # This hook checks toml files for parseable syntax.
16 | - id: check-toml
17 | # This hook checks yaml files for parseable syntax.
18 | - id: check-yaml
19 | - repo: https://github.com/charliermarsh/ruff-pre-commit
20 | rev: v0.14.8
21 | hooks:
22 | - id: ruff
23 | args:
24 | - --fix
25 | - repo: https://github.com/python/black
26 | rev: 25.12.0
27 | hooks:
28 | - id: black
29 | language_version: python3
30 | - repo: https://github.com/python-poetry/poetry
31 | rev: "2.2.1"
32 | hooks:
33 | - id: poetry-check
34 | - repo: https://github.com/tox-dev/pyproject-fmt
35 | rev: "v2.11.1"
36 | hooks:
37 | - id: pyproject-fmt
38 | - repo: https://github.com/rhysd/actionlint
39 | rev: v1.7.9
40 | hooks:
41 | - id: actionlint
42 | - repo: https://github.com/citation-file-format/cffconvert
43 | rev: b6045d78aac9e02b039703b030588d54d53262ac
44 | hooks:
45 | - id: validate-cff
46 | - repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
47 | rev: v0.6.0
48 | hooks:
49 | - id: pre-commit-update
50 | args: ["--keep", "pre-commit-update", "--keep", "cffconvert"]
51 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier/50.00.supplementary_material.md:
--------------------------------------------------------------------------------
1 | ### Latent variables (gene modules) information
2 |
3 | #### LV603
4 |
5 |
6 | | Pathway | AUC | FDR |
7 | |:------------------------------------|:------|:---------|
8 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 |
9 | | SVM Neutrophils | 0.98 | 1.43e-09 |
10 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 |
11 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 |
12 |
13 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603}
14 |
15 |
16 |
17 | | Trait description | Sample size | Cases | FDR |
18 | |:------------------------------------------|:--------------|:--------|:---------------|
19 | | Basophill percentage | 349,861 | | 1.19e‑10 |
20 | | Basophill count | 349,856 | | 1.89e‑05 |
21 | | Treatment/medication code: ispaghula husk | 361,141 | 327 | 1.36e‑02 |
22 |
23 | Table: Significant trait associations of LV603 in PhenomeXcan. {#tbl:sup:phenomexcan_assocs:lv603}
24 |
25 |
26 |
27 | | Phecode | Trait description | Sample size | Cases | FDR |
28 | |:----------------------------|:--------------------|:--------------|:--------|:------|
29 | | No significant associations | | | | |
30 |
31 | Table: Significant trait associations of LV603 in eMERGE. {#tbl:sup:emerge_assocs:lv603}
32 |
33 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/04.20.00.traits_clustering.md:
--------------------------------------------------------------------------------
1 | ### LVs reveal trait clusters with shared transcriptomic properties
2 |
3 | We used the projection of gene-trait associations into the latent space to find groups of clusters linked by the same transcriptional processes.
4 | Since individual clustering algorithms have different biases (i.e., assumptions about the data structure), we designed a consensus clustering framework that combines solutions or partitions of traits generated by different methods ([Methods](#sec:methods:clustering)).
5 | Consensus or ensemble approaches have been recommended to avoid several pitfalls when performing cluster analysis on biological data [@doi:10.1126/scisignal.aad1932].
6 | Since diversity in the ensemble is crucial for these methods, we generated different data versions which were processed using different methods with varying sets of parameters (Figure {@fig:clustering:design}a).
7 | Then, a consensus function combines the ensemble into a consolidated solution, which has been shown to outperform any individual member of the ensemble [@Strehl2002; @doi:10.1109/TPAMI.2005.113].
8 | Our clustering pipeline generated 15 final consensus clustering solutions (Figure @fig:sup:clustering:agreement).
9 | The number of clusters of these partitions (between 5 to 29) was learned from the data by selecting the partitions with the largest agreement with the ensemble [@Strehl2002].
10 | Instead of selecting one of these final solutions with a specific number of clusters, we used a clustering tree [@doi:10.1093/gigascience/giy083] (Figure @fig:clustering:tree) to examine stable groups of traits across multiple resolutions.
11 | To understand which latent variables differentiated the group of traits, we trained a decision tree classifier on the input data $\hat{\mathbf{M}}$ using the clusters found as labels (Figure {@fig:clustering:design}b, see [Methods](#sec:methods:clustering)).
12 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/prompt_gpt3_e2e/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts:
2 | front_matter: Revise the following paragraph to include the keyword "testify" somewhere in the text; the keyword must be present verbatim.
3 | abstract: Revise the following paragraph to include the keyword "orchestra" somewhere in the text; the keyword must be present verbatim.
4 | introduction: Revise the following paragraph to include the keyword "wound" somewhere in the text; the keyword must be present verbatim.
5 | results: Revise the following paragraph to include the keyword "classroom" somewhere in the text; the keyword must be present verbatim.
6 | results_framework: Revise the following paragraph to include the keyword "secretary" somewhere in the text; the keyword must be present verbatim.
7 | crispr: Revise the following paragraph to include the keyword "army" somewhere in the text; the keyword must be present verbatim.
8 | drug_disease_prediction: Revise the following paragraph to include the keyword "breakdown" somewhere in the text; the keyword must be present verbatim.
9 | traits_clustering: Revise the following paragraph to include the keyword "siege" somewhere in the text; the keyword must be present verbatim.
10 | discussion: Revise the following paragraph to include the keyword "beer" somewhere in the text; the keyword must be present verbatim.
11 | methods: Revise the following paragraph to include the keyword "confront" somewhere in the text; the keyword must be present verbatim.
12 | references: Revise the following paragraph to include the keyword "disability" somewhere in the text; the keyword must be present verbatim.
13 | acknowledgements: Revise the following paragraph to include the keyword "stitch" somewhere in the text; the keyword must be present verbatim.
14 | supplementary_material: Revise the following paragraph to include the keyword "waiter" somewhere in the text; the keyword must be present verbatim.
15 |
16 | default: |
17 | This is the default prompt
18 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/01.abstract.md:
--------------------------------------------------------------------------------
1 | ## Abstract {.page_break_before}
2 |
3 | Maintaining germline genome integrity is essential and enormously complex.
4 | Hundreds of proteins are involved in DNA replication and proofreading, and hundreds more are mobilized to repair DNA damage [@PMID:28485537].
5 | While loss-of-function mutations in any of the genes encoding these proteins might lead to elevated mutation rates, *mutator alleles* have largely eluded detection in mammals.
6 |
7 | DNA replication and repair proteins often recognize particular sequence motifs or excise lesions at specific nucleotides.
8 | Thus, we might expect that the spectrum of *de novo* mutations — that is, the frequency of each individual mutation type (C>T, A>G, etc.) — will differ between genomes that harbor either a mutator or wild-type allele at a given locus.
9 | Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene *Mutyh* that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs [@PMID:35545679;@PMID:33472028].
10 |
11 | In this study we developed a new method, called "aggregate mutation spectrum distance," to detect alleles associated with mutation spectrum variation.
12 | By applying this approach to mutation data from the BXDs, we confirmed the presence of the germline mutator locus near *Mutyh* and discovered an additional C>A mutator locus on chromosome 6 that overlaps *Ogg1*, a DNA glycosylase involved in the same base-excision repair network as *Mutyh* [@PMID:17581577].
13 | The effect of a chromosome 6 mutator allele depended on the presence of a mutator allele near *Mutyh*, and BXDs with mutator alleles at both loci had even greater numbers of C>A mutations than those with mutator alleles at either locus alone.
14 | Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles, and may be applicable to mutation data from humans and other model organisms.
15 |
16 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier/50.01.supplementary_material.md:
--------------------------------------------------------------------------------
1 | ### Latent variables (gene modules) information
2 |
3 | #### LV603
4 |
5 |
7 | | Pathway | AUC | FDR |
8 | |:------------------------------------|:------|:---------|
9 | | IRIS Neutrophil-Resting | 0.91 | 4.51e-35 |
10 | | SVM Neutrophils | 0.98 | 1.43e-09 |
11 | | PID IL8CXCR2 PATHWAY | 0.81 | 7.04e-03 |
12 | | SIG PIP3 SIGNALING IN B LYMPHOCYTES | 0.77 | 1.95e-02 |
13 |
14 | Table: Pathways aligned to LV603 from the MultiPLIER models. {#tbl:sup:multiplier_pathways:lv603}
15 |
16 |
17 |
21 | | Trait description | Sample size | Cases | FDR |
22 | |:------------------------------------------|:--------------|:--------|:---------------|
23 | | Basophill percentage | 349,861 | | 1.19e‑10 |
24 | | Basophill count | 349,856 | | 1.89e‑05 |
25 | | Treatment/medication code: ispaghula husk | 361,141 | 327 | 1.36e‑02 |
26 |
27 | Table: Significant trait associations of LV603 in PhenomeXcan. {#tbl:sup:phenomexcan_assocs:lv603}
28 |
32 |
33 |
39 | | Phecode | Trait description | Sample size | Cases | FDR |
40 | |:----------------------------|:--------------------|:--------------|:--------|:------|
41 | | No significant associations | | | | |
42 |
43 | Table: Significant trait associations of LV603 in eMERGE. {#tbl:sup:emerge_assocs:lv603}
44 |
45 |
--------------------------------------------------------------------------------
/CITATION.cff:
--------------------------------------------------------------------------------
1 | # This CITATION.cff file was generated with cffinit.
2 | # Visit https://bit.ly/cffinit to generate yours today!
3 | ---
4 | cff-version: 1.2.0
5 | title: Manubot AI Editor
6 | message: >-
7 | If you use this work in some way, please cite both the article from
8 | preferred-citation and the software itself. These details can be
9 | found within the CITATION.cff file.
10 | type: software
11 | authors:
12 | - given-names: Milton
13 | family-names: Pividori
14 | orcid: "https://orcid.org/0000-0002-3035-4403"
15 | - given-names: Faisal
16 | family-names: Alquaddoomi
17 | orcid: "https://orcid.org/0000-0003-4297-8747"
18 | - given-names: Vincent
19 | family-names: Rubinetti
20 | orcid: "https://orcid.org/0000-0002-4655-3773"
21 | - given-names: Dave
22 | family-names: Bunten
23 | orcid: "https://orcid.org/0000-0001-6041-3665"
24 | - given-names: Casey
25 | family-names: Greene
26 | orcid: "https://orcid.org/0000-0001-8713-9213"
27 | repository-code: "https://github.com/manubot/manubot-ai-editor"
28 | abstract: |
29 | A tool for performing automatic, AI-assisted revisions of Manubot manuscripts.
30 | keywords:
31 | - manubot
32 | - AI
33 | - editor
34 | - manuscript
35 | - revision
36 | - research
37 | - large-language-models
38 | license: BSD-3-Clause
39 | identifiers:
40 | - description: Manuscript
41 | type: doi
42 | value: "10.1093/jamia/ocae139"
43 | - description: Software
44 | type: doi
45 | value: "10.5281/zenodo.14911573"
46 | preferred-citation:
47 | title: >-
48 | A publishing infrastructure for Artificial Intelligence (AI)-assisted academic authoring
49 | type: article
50 | url: https://academic.oup.com/jamia/article/31/9/2103/7693927
51 | authors:
52 | - given-names: Milton
53 | family-names: Pividori
54 | orcid: "https://orcid.org/0000-0002-3035-4403"
55 | - given-names: Casey S.
56 | family-names: Greene
57 | orcid: "https://orcid.org/0000-0001-8713-9213"
58 | date-published: 2024-09-01
59 | identifiers:
60 | - type: doi
61 | value: 10.1093/jamia/ocae139
62 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "An efficient not-only-linear correlation coefficient based on machine learning"
3 | keywords:
4 | - correlation coefficient
5 | - nonlinear relationships
6 | - gene expression
7 | lang: en-US
8 | authors:
9 | - name: Milton Pividori
10 | github: miltondp
11 | initials: MP
12 | orcid: 0000-0002-3035-4403
13 | twitter: miltondp
14 | email: miltondp@gmail.com
15 | affiliations:
16 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
17 | funders:
18 | - The Gordon and Betty Moore Foundation GBMF 4552
19 | - The National Human Genome Research Institute (R01 HG010067)
20 |
21 | - name: Marylyn D. Ritchie
22 | initials: MDR
23 | orcid: 0000-0002-1208-1720
24 | twitter: MarylynRitchie
25 | email: marylyn@pennmedicine.upenn.edu
26 | affiliations:
27 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
28 |
29 | - name: Diego H. Milone
30 | github: dmilone
31 | initials: DHM
32 | orcid: 0000-0003-2182-4351
33 | twitter: d1001
34 | email: dmilone@sinc.unl.edu.ar
35 | affiliations:
36 | - Research Institute for Signals, Systems and Computational Intelligence (sinc(i)), Universidad Nacional del Litoral, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Santa Fe CP3000, Argentina
37 |
38 | - name: Casey S. Greene
39 | github: cgreene
40 | initials: CSG
41 | orcid: 0000-0001-8713-9213
42 | twitter: GreeneScientist
43 | email: casey.s.greene@cuanschutz.edu
44 | affiliations:
45 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
46 | - Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
47 | funders:
48 | - The Gordon and Betty Moore Foundation (GBMF 4552)
49 | - The National Human Genome Research Institute (R01 HG010067)
50 | - The National Cancer Institute (R01 CA237170)
51 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "An efficient not-only-linear correlation coefficient based on machine learning"
3 | keywords:
4 | - correlation coefficient
5 | - nonlinear relationships
6 | - gene expression
7 | lang: en-US
8 | authors:
9 | - name: Milton Pividori
10 | github: miltondp
11 | initials: MP
12 | orcid: 0000-0002-3035-4403
13 | twitter: miltondp
14 | email: miltondp@gmail.com
15 | affiliations:
16 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
17 | funders:
18 | - The Gordon and Betty Moore Foundation GBMF 4552
19 | - The National Human Genome Research Institute (R01 HG010067)
20 |
21 | - name: Marylyn D. Ritchie
22 | initials: MDR
23 | orcid: 0000-0002-1208-1720
24 | twitter: MarylynRitchie
25 | email: marylyn@pennmedicine.upenn.edu
26 | affiliations:
27 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
28 |
29 | - name: Diego H. Milone
30 | github: dmilone
31 | initials: DHM
32 | orcid: 0000-0003-2182-4351
33 | twitter: d1001
34 | email: dmilone@sinc.unl.edu.ar
35 | affiliations:
36 | - Research Institute for Signals, Systems and Computational Intelligence (sinc(i)), Universidad Nacional del Litoral, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Santa Fe CP3000, Argentina
37 |
38 | - name: Casey S. Greene
39 | github: cgreene
40 | initials: CSG
41 | orcid: 0000-0001-8713-9213
42 | twitter: GreeneScientist
43 | email: casey.s.greene@cuanschutz.edu
44 | affiliations:
45 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
46 | - Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
47 | funders:
48 | - The Gordon and Betty Moore Foundation (GBMF 4552)
49 | - The National Human Genome Research Institute (R01 HG010067)
50 | - The National Cancer Institute (R01 CA237170)
51 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/only_revision_prompts/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts_files:
2 | abstract: |
3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so
4 | the research problem/question is clear,
5 | the solution proposed is clear,
6 | the text grammar is correct,
7 | spelling errors are fixed,
8 | and the text is in active voice and has a clear sentence structure
9 |
10 | introduction|discussion: |
11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so
12 | the research problem/question is clear,
13 | the solution proposed is clear,
14 | the text grammar is correct,
15 | spelling errors are fixed,
16 | and the text is in active voice and has a clear sentence structure
17 |
18 | results: |
19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so
20 | most references to figures and tables are kept,
21 | the details are enough to clearly explain the outcomes,
22 | sentences are concise and to the point,
23 | the text minimizes the use of jargon,
24 | the text grammar is correct,
25 | spelling errors are fixed,
26 | and the text has a clear sentence structure
27 |
28 | methods: |
29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so
30 | most of the citations to other academic papers are kept,
31 | most of the technical details are kept,
32 | most references to equations (such as "Equation (@id)") are kept,
33 | all equations definitions (such as '*equation_definition') are included with newlines before and after,
34 | the most important symbols in equations are defined,
35 | the text grammar is correct,
36 | spelling errors are fixed,
37 | and the text has a clear sentence structure
38 |
39 | references: null
40 |
41 | \.md$: |
42 | Proofread the following paragraph
43 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/both_prompts_config/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts:
2 | abstract: |
3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so
4 | the research problem/question is clear,
5 | the solution proposed is clear,
6 | the text grammar is correct,
7 | spelling errors are fixed,
8 | and the text is in active voice and has a clear sentence structure
9 |
10 | introduction_discussion: |
11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so
12 | the research problem/question is clear,
13 | the solution proposed is clear,
14 | the text grammar is correct,
15 | spelling errors are fixed,
16 | and the text is in active voice and has a clear sentence structure
17 |
18 | results: |
19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so
20 | most references to figures and tables are kept,
21 | the details are enough to clearly explain the outcomes,
22 | sentences are concise and to the point,
23 | the text minimizes the use of jargon,
24 | the text grammar is correct,
25 | spelling errors are fixed,
26 | and the text has a clear sentence structure
27 |
28 | methods: |
29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so
30 | most of the citations to other academic papers are kept,
31 | most of the technical details are kept,
32 | most references to equations (such as "Equation (@id)") are kept,
33 | all equations definitions (such as '*equation_definition') are included with newlines before and after,
34 | the most important symbols in equations are defined,
35 | the text grammar is correct,
36 | spelling errors are fixed,
37 | and the text has a clear sentence structure
38 |
39 | default: |
40 | Proofread the following paragraph (with the title '{title}' and keywords '{keywords}')
41 |
--------------------------------------------------------------------------------
/tests/config_loader_fixtures/conflicting_promptsfiles_matchings/ai-revision-prompts.yaml:
--------------------------------------------------------------------------------
1 | prompts_files:
2 | abstract: |
3 | Revise the following paragraph from the Abstract of an academic paper (with the title '{title}' and keywords '{keywords}') so
4 | the research problem/question is clear,
5 | the solution proposed is clear,
6 | the text grammar is correct,
7 | spelling errors are fixed,
8 | and the text is in active voice and has a clear sentence structure
9 |
10 | introduction|discussion: |
11 | Revise the following paragraph from the {section_name} of an academic paper (with the title '{title}' and keywords '{keywords}') so
12 | the research problem/question is clear,
13 | the solution proposed is clear,
14 | the text grammar is correct,
15 | spelling errors are fixed,
16 | and the text is in active voice and has a clear sentence structure
17 |
18 | results: |
19 | Revise the following paragraph from the Results section of an academic paper (with the title '{title}' and keywords '{keywords}') so
20 | most references to figures and tables are kept,
21 | the details are enough to clearly explain the outcomes,
22 | sentences are concise and to the point,
23 | the text minimizes the use of jargon,
24 | the text grammar is correct,
25 | spelling errors are fixed,
26 | and the text has a clear sentence structure
27 |
28 | methods: |
29 | Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{title}' and keywords '{keywords}') so
30 | most of the citations to other academic papers are kept,
31 | most of the technical details are kept,
32 | most references to equations (such as "Equation (@id)") are kept,
33 | all equations definitions (such as '*equation_definition') are included with newlines before and after,
34 | the most important symbols in equations are defined,
35 | the text grammar is correct,
36 | spelling errors are fixed,
37 | and the text has a clear sentence structure
38 |
39 | references: null
40 |
41 | default: This is the default prompt
42 |
43 | \.md$: |
44 | Proofread the following paragraph
45 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/00.front-matter.md:
--------------------------------------------------------------------------------
1 | {##
2 | This file contains a Jinja2 front-matter template that adds version and authorship information.
3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates.
4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template.
5 | ##}
6 |
7 | _A DOI-citable version of this manuscript is available at _.
8 |
9 |
23 |
24 | ## Authors
25 |
26 | {## Template for listing authors ##}
27 | {% for author in manubot.authors %}
28 | + **{{author.name}}**
29 | {%- if author.orcid is defined and author.orcid is not none %}
30 | {.inline_icon width=16 height=16}
31 | [{{author.orcid}}](https://orcid.org/{{author.orcid}})
32 | {%- endif %}
33 | {%- if author.github is defined and author.github is not none %}
34 | · {.inline_icon width=16 height=16}
35 | [{{author.github}}](https://github.com/{{author.github}})
36 | {%- endif %}
37 | {%- if author.twitter is defined and author.twitter is not none %}
38 | · {.inline_icon width=16 height=16}
39 | [{{author.twitter}}](https://twitter.com/{{author.twitter}})
40 | {%- endif %}
41 |
42 | {%- if author.affiliations is defined and author.affiliations|length %}
43 | {{author.affiliations | join('; ')}}
44 | {%- endif %}
45 | {%- if author.funders is defined and author.funders|length %}
46 | · Funded by {{author.funders | join('; ')}}
47 | {%- endif %}
48 |
49 | {% endfor %}
50 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/00.front-matter.md:
--------------------------------------------------------------------------------
1 | {##
2 | This file contains a Jinja2 front-matter template that adds version and authorship information.
3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates.
4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template.
5 | ##}
6 |
7 | _A DOI-citable version of this manuscript is available at _.
8 |
9 |
23 |
24 | ## Authors
25 |
26 | {## Template for listing authors ##}
27 | {% for author in manubot.authors %}
28 | + **{{author.name}}**
29 | {%- if author.orcid is defined and author.orcid is not none %}
30 | {.inline_icon width=16 height=16}
31 | [{{author.orcid}}](https://orcid.org/{{author.orcid}})
32 | {%- endif %}
33 | {%- if author.github is defined and author.github is not none %}
34 | · {.inline_icon width=16 height=16}
35 | [{{author.github}}](https://github.com/{{author.github}})
36 | {%- endif %}
37 | {%- if author.twitter is defined and author.twitter is not none %}
38 | · {.inline_icon width=16 height=16}
39 | [{{author.twitter}}](https://twitter.com/{{author.twitter}})
40 | {%- endif %}
41 |
42 | {%- if author.affiliations is defined and author.affiliations|length %}
43 | {{author.affiliations | join('; ')}}
44 | {%- endif %}
45 | {%- if author.funders is defined and author.funders|length %}
46 | · Funded by {{author.funders | join('; ')}}
47 | {%- endif %}
48 |
49 | {% endfor %}
50 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
131 | .idea/
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/08.15.methods.giant.md:
--------------------------------------------------------------------------------
1 | ### Tissue-specific network analyses using GIANT {#sec:giant}
2 |
3 | We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase [@url:https://hb.flatironinstitute.org].
4 | The GIANT version used in this study included 987 genome-scale datasets with approximately 38,000 conditions from around 14,000 publications.
5 | Details on how these networks were built are described in [@doi:10.1038/ng.3259].
6 | Briefly, tissue-specific gene networks were built using gene expression data (without GTEx samples [@url:https://hb.flatironinstitute.org/data]) from the NCBI's Gene Expression Omnibus (GEO) [@doi:10.1093/nar/gks1193], protein-protein interaction (BioGRID [@pmc:PMC3531226], IntAct [@doi:10.1093/nar/gkr1088], MINT [@doi:10.1093/nar/gkr930] and MIPS [@pmc:PMC148093]), transcription factor regulation using binding motifs from JASPAR [@doi:10.1093/nar/gkp950], and chemical and genetic perturbations from MSigDB [@doi:10.1073/pnas.0506580102].
7 | Gene expression data were log-transformed, and the Pearson correlation was computed for each gene pair, normalized using the Fisher's z transform, and z-scores discretized into different bins.
8 | Gold standards for tissue-specific functional relationships were built using expert curation and experimentally derived gene annotations from the Gene Ontology.
9 | Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards.
10 | Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene pair.
11 |
12 |
13 | For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain
14 | 1) a predicted gene network for blood (manually selected to match whole blood in GTEx) and
15 | 2) a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services.
16 | Briefly, the tissue prediction approach trains a machine learning model using comprehensive transcriptional data with human-curated markers of different cell lineages (e.g., macrophages) as gold standards.
17 | Then, these models are used to predict other cell lineage-specific genes.
18 | In addition to reporting this predicted tissue or cell lineage, we computed the average probability of interaction between all genes in the network retrieved from GIANT.
19 | Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network.
20 |
--------------------------------------------------------------------------------
/tests/provider_fixtures/provider_model_engines.json:
--------------------------------------------------------------------------------
1 | {
2 | "OpenAIProvider": [
3 | "gpt-4o-audio-preview-2024-12-17",
4 | "dall-e-3",
5 | "text-embedding-3-large",
6 | "dall-e-2",
7 | "o4-mini-2025-04-16",
8 | "gpt-4o-audio-preview-2024-10-01",
9 | "o4-mini",
10 | "gpt-4.1-nano",
11 | "o1-2024-12-17",
12 | "gpt-4.1-nano-2025-04-14",
13 | "gpt-4o-realtime-preview-2024-10-01",
14 | "o1",
15 | "gpt-4o-realtime-preview",
16 | "babbage-002",
17 | "gpt-4-turbo-preview",
18 | "o1-pro",
19 | "o1-pro-2025-03-19",
20 | "tts-1-hd-1106",
21 | "gpt-4-0125-preview",
22 | "gpt-4",
23 | "text-embedding-ada-002",
24 | "o3-2025-04-16",
25 | "tts-1-hd",
26 | "gpt-4o-mini-audio-preview",
27 | "gpt-4o-audio-preview",
28 | "o1-preview-2024-09-12",
29 | "o3",
30 | "gpt-4o-mini-realtime-preview",
31 | "gpt-4.1-mini",
32 | "gpt-4o-mini-realtime-preview-2024-12-17",
33 | "gpt-3.5-turbo-instruct-0914",
34 | "gpt-4o-mini-search-preview",
35 | "gpt-4.1-mini-2025-04-14",
36 | "tts-1-1106",
37 | "chatgpt-4o-latest",
38 | "davinci-002",
39 | "gpt-3.5-turbo-1106",
40 | "gpt-4o-search-preview",
41 | "gpt-4-turbo",
42 | "gpt-4o-realtime-preview-2024-12-17",
43 | "gpt-3.5-turbo-instruct",
44 | "gpt-3.5-turbo",
45 | "gpt-4-1106-preview",
46 | "gpt-4o-mini-search-preview-2025-03-11",
47 | "gpt-4o-2024-11-20",
48 | "whisper-1",
49 | "gpt-4o-2024-05-13",
50 | "gpt-4-turbo-2024-04-09",
51 | "gpt-3.5-turbo-16k",
52 | "o1-preview",
53 | "gpt-4-0613",
54 | "computer-use-preview-2025-03-11",
55 | "computer-use-preview",
56 | "gpt-4.5-preview",
57 | "gpt-4.5-preview-2025-02-27",
58 | "gpt-4o-search-preview-2025-03-11",
59 | "tts-1",
60 | "omni-moderation-2024-09-26",
61 | "text-embedding-3-small",
62 | "gpt-4o-mini-tts",
63 | "gpt-4o",
64 | "o3-mini",
65 | "o3-mini-2025-01-31",
66 | "gpt-4o-mini",
67 | "gpt-4o-2024-08-06",
68 | "gpt-4.1",
69 | "gpt-4o-transcribe",
70 | "gpt-4.1-2025-04-14",
71 | "gpt-4o-mini-2024-07-18",
72 | "gpt-4o-mini-transcribe",
73 | "o1-mini",
74 | "gpt-4o-mini-audio-preview-2024-12-17",
75 | "gpt-3.5-turbo-0125",
76 | "o1-mini-2024-09-12",
77 | "omni-moderation-latest"
78 | ],
79 | "AnthropicProvider": [
80 | "claude-3-7-sonnet-20250219",
81 | "claude-3-5-sonnet-20241022",
82 | "claude-3-5-haiku-20241022",
83 | "claude-3-5-sonnet-20240620",
84 | "claude-3-haiku-20240307",
85 | "claude-3-opus-20240229",
86 | "claude-3-sonnet-20240229",
87 | "claude-2.1",
88 | "claude-2.0"
89 | ],
90 | "__generated_on__": "2025-04-17T11:20:13.938509"
91 | }
--------------------------------------------------------------------------------
/tests/provider_fixtures/refresh_model_engines.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | """
4 | This command persists model engines for each provider to MODEL_PROVIDER_JSON.
5 | This list is used in non-live tests so that we don't need users who run the test
6 | suite to provide valid API keys for every provider.
7 | """
8 |
9 | from datetime import datetime
10 | import json
11 | import os
12 | from pathlib import Path
13 | from manubot_ai_editor.model_providers import MODEL_PROVIDERS
14 |
15 | MODEL_PROVIDER_JSON = "provider_model_engines.json"
16 |
17 |
18 | def persist_provider_model_engines():
19 | """
20 | Persists the default model engines for each provider to a JSON file,
21 | distributed with the package. This method requires valid API keys to be
22 | available for each provider, since the list of models per provider is pulled
23 | from the API.
24 |
25 | The JSON file is used as as a fallback in case the API is for some reason
26 | unavailable, e.g. in testing when we don't have valid keys.
27 |
28 | It's unfortunate that the model providers require authenticated access to
29 | get the list of models, but that's how it is. Ideally we'd run this with
30 | each release, to at least capture which model engines are available at the
31 | time of release.
32 | """
33 |
34 | with Path(MODEL_PROVIDER_JSON).open("w") as f:
35 | model_list = {
36 | provider.__class__.__name__: provider.get_models()
37 | for provider in MODEL_PROVIDERS.values()
38 | if not provider.is_local_provider()
39 | }
40 | model_list["__generated_on__"] = datetime.now().isoformat()
41 |
42 | json.dump(model_list, f, indent=2)
43 |
44 | return model_list
45 |
46 |
47 | def retrieve_provider_model_engines():
48 | """
49 | Pulls the default model engines for each provider from the JSON file
50 | distributed with the package.
51 | """
52 |
53 | with Path(MODEL_PROVIDER_JSON).open("r") as f:
54 | return json.load(f)
55 |
56 |
57 | def main():
58 | # check if we have valid API keys for each provider
59 | for provider in [x for x in MODEL_PROVIDERS.values() if x.is_local_provider()]:
60 | provider_key_var = provider.api_key_env_var()
61 |
62 | if not os.environ.get(provider_key_var):
63 | raise ValueError(
64 | f"Provider {provider.__class__.__name__} requires an API key in"
65 | f" env var {provider_key_var}, but none is set."
66 | )
67 |
68 | # persist the provider model engines to a JSON file
69 | new_list = persist_provider_model_engines()
70 |
71 | print(
72 | f"Persisted {sum(len(x) for x in new_list.values())} model engines"
73 | f" to {MODEL_PROVIDER_JSON}"
74 | )
75 |
76 |
77 | if __name__ == "__main__":
78 | main()
79 |
--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
1 | """
2 | Configures 'cost' marker for tests that cost money (i.e. OpenAI API credits)
3 | to run.
4 |
5 | Adapted from https://docs.pytest.org/en/latest/example/simple.html#control-skipping-of-tests-according-to-command-line-option
6 | """
7 |
8 | import json
9 | import pytest
10 |
11 | from pathlib import Path
12 | from unittest import mock
13 |
14 |
15 | def pytest_addoption(parser):
16 | parser.addoption(
17 | "--runcost",
18 | action="store_true",
19 | default=False,
20 | help="run tests that can incur API usage costs",
21 | )
22 |
23 |
24 | def pytest_configure(config):
25 | config.addinivalue_line(
26 | "markers", "cost: mark test as possibly costing money to run"
27 | )
28 | config.addinivalue_line(
29 | "markers",
30 | "mocked_model_list: mark test as having used the provider's cached model list",
31 | )
32 |
33 |
34 | def pytest_collection_modifyitems(config, items):
35 | if config.getoption("--runcost"):
36 | # --runcost given in cli: do not skip cost tests
37 | return
38 |
39 | skip_cost = pytest.mark.skip(reason="need --runcost option to run")
40 |
41 | for item in items:
42 | if "cost" in item.keywords:
43 | item.add_marker(skip_cost)
44 |
45 |
46 | # since we don't have valid provider API keys during tests, we can't get the
47 | # model list from the provider API. instead, we mock the method that retrieves
48 | # the model list to return a cached version of the model list stored in the
49 | # provider_model_engines.json file
50 | @pytest.fixture(autouse=True, scope="function")
51 | def patch_model_list_cache(request):
52 | # skip patching if the test or anything above it is marked with 'cost',
53 | # which implies we have a valid API key and thus should retrieve the model
54 | # list from the provider API
55 | if request.node.get_closest_marker("cost") is not None:
56 | yield
57 | return
58 |
59 | # path to the provider_model_engine.json file
60 | provider_model_engine_json = (
61 | Path(__file__).parent / "provider_fixtures" / "provider_model_engines.json"
62 | )
63 |
64 | # load the provider model list once, then use it in our mocked method
65 | with provider_model_engine_json.open("r") as f:
66 | provider_model_engines = json.load(f)
67 |
68 | @classmethod
69 | def cached_model_list_retriever(cls):
70 | # annotate the request object to indicate we're using a mocked method
71 | # we want the live API tests to ensure they're not using this mock
72 | request.node.add_marker("mocked_model_list")
73 |
74 | return provider_model_engines[cls.__name__]
75 |
76 | # finally, apply the mock
77 | with mock.patch(
78 | "manubot_ai_editor.model_providers.BaseModelProvider.get_models",
79 | new=cached_model_list_retriever,
80 | ) as mock_method:
81 | yield mock_method
82 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/15.acknowledgements.md:
--------------------------------------------------------------------------------
1 | ## Acknowledgements
2 |
3 | This study was funded by:
4 | the Gordon and Betty Moore Foundation (GBMF 4552 to C.S. Greene; GBMF 4560 to B.D. Sullivan),
5 | the National Human Genome Research Institute (R01 HG010067 to C.S. Greene, S.F.A. Grant and B.D. Sullivan; K99 HG011898 and R00 HG011898 to M. Pividori; U01 HG011181 to W. Wei),
6 | the National Cancer Institute (R01 CA237170 to C.S. Greene),
7 | the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765 to C.S. Greene),
8 | the National Institute of Aging (R01AG069900 to W. Wei),
9 | the National Institute of General Medical Sciences (R01 GM139891 to W. Wei);
10 | the National Heart, Lung, and Blood Institute (R01 HL163854 to Q. Feng);
11 | the National Institute of Diabetes and Digestive and Kidney Diseases (DK126194 to B.F. Voight);
12 | the Daniel B. Burke Endowed Chair for Diabetes Research to S.F.A. Grant;
13 | the Robert L. McNeil Jr. Endowed Fellowship in Translational Medicine and Therapeutics to C. Skarke.
14 |
15 | The Phase III of the eMERGE Network was initiated and funded by the NHGRI through the following grants:
16 | U01 HG8657 (Group Health Cooperative/University of Washington);
17 | U01 HG8685 (Brigham and Womens Hospital);
18 | U01 HG8672 (Vanderbilt University Medical Center);
19 | U01 HG8666 (Cincinnati Childrens Hospital Medical Center);
20 | U01 HG6379 (Mayo Clinic);
21 | U01 HG8679 (Geisinger Clinic);
22 | U01 HG8680 (Columbia University Health Sciences);
23 | U01 HG8684 (Childrens Hospital of Philadelphia);
24 | U01 HG8673 (Northwestern University);
25 | U01 HG8701 (Vanderbilt University Medical Center serving as the Coordinating Center);
26 | U01 HG8676 (Partners Healthcare/Broad Institute);
27 | and U01 HG8664 (Baylor College of Medicine).
28 |
29 | The Penn Medicine BioBank (PMBB) is funded by the Perelman School of Medicine at the University of Pennsylvania, a gift from the Smilow family, and the National Center for Advancing Translational Sciences of the National Institutes of Health under CTSA Award Number UL1TR001878.
30 | We thank D. Birtwell, H. Williams, P. Baumann and M. Risman for informatics support regarding the PMBB.
31 | We thank the staff of the Regeneron Genetics Center for whole-exome sequencing of DNA from PMBB participants.
32 |
33 | Figure {@fig:entire_process}a was created with BioRender.com.
34 |
35 |
36 | ## Author contributions statement
37 |
38 | M. Pividori and C.S. Greene conceived and designed the study.
39 | M. Pividori designed the computational methods, performed the experiments, analyzed the data, interpreted the results, and drafted the manuscript.
40 | C.S. Greene supervised the entire project and provided critical guidance throughout the study.
41 | S. Lu, C. Su, and M.E. Johnson performed the CRISPR screen with the supervision of S.F.A. Grant.
42 | B. Li provided the TWAS results for eMERGE for replication, and this analysis was supervised by M.D. Ritchie.
43 | W. Wei, Q. Feng, B. Namjou, K. Kiryluk, I. Kullo, Y. Luo, and M.D. Ritchie, as part of the eMERGE consortium, provided critical feedback regarding the analyses of this data.
44 | All authors revised the manuscript and provided critical feedback.
45 |
46 | ## Competing interests statement
47 |
48 | The authors declare no competing interests.
49 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/00.front-matter.md:
--------------------------------------------------------------------------------
1 | {##
2 | This file contains a Jinja2 front-matter template that adds version and authorship information.
3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates.
4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template.
5 | ##}
6 |
7 | _A DOI-citable version of this manuscript is available at
_
8 |
9 |
23 |
24 | {% if manubot.date_long != manubot.generated_date_long -%}
25 | Published: {{manubot.date_long}}
26 | {% endif %}
27 |
28 | ## Authors
29 |
30 | {## Template for listing authors ##}
31 | {% for author in manubot.authors %}
32 | + **{{author.name}}**
33 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%}
34 |
35 | {%- set has_ids = false %}
36 | {%- if author.orcid is defined and author.orcid is not none %}
37 | {%- set has_ids = true %}
38 | {.inline_icon width=16 height=16}
39 | [{{author.orcid}}](https://orcid.org/{{author.orcid}})
40 | {%- endif %}
41 | {%- if author.github is defined and author.github is not none %}
42 | {%- set has_ids = true %}
43 | · {.inline_icon width=16 height=16}
44 | [{{author.github}}](https://github.com/{{author.github}})
45 | {%- endif %}
46 | {%- if author.twitter is defined and author.twitter is not none %}
47 | {%- set has_ids = true %}
48 | · {.inline_icon width=16 height=16}
49 | [{{author.twitter}}](https://twitter.com/{{author.twitter}})
50 | {%- endif %}
51 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %}
52 | {%- set has_ids = true %}
53 | · {.inline_icon width=16 height=16}
54 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}})
55 | {%- endif %}
56 | {%- if has_ids %}
57 |
58 | {%- endif %}
59 |
60 | {%- if author.affiliations is defined and author.affiliations|length %}
61 | {{author.affiliations | join('; ')}}
62 | {%- endif %}
63 | {%- if author.funders is defined and author.funders|length %}
64 | · Funded by {{author.funders | join('; ')}}
65 | {%- endif %}
66 |
67 | {% endfor %}
68 |
69 | ::: {#correspondence}
70 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %}
71 | {% if manubot.authors|map(attribute='corresponding')|select|max -%}
72 | or email to
73 | {% for author in manubot.authors|selectattr("corresponding") -%}
74 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }}
75 | {% endfor %}
76 | {% endif %}
77 | :::
78 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/00.front-matter.md:
--------------------------------------------------------------------------------
1 | {##
2 | This file contains a Jinja2 front-matter template that adds version and authorship information.
3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates.
4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template.
5 | ##}
6 |
7 | _A DOI-citable version of this manuscript is available at
_
8 |
9 |
23 |
24 | {% if manubot.date_long != manubot.generated_date_long -%}
25 | Published: {{manubot.date_long}}
26 | {% endif %}
27 |
28 | ## Authors
29 |
30 | {## Template for listing authors ##}
31 | {% for author in manubot.authors %}
32 | + **{{author.name}}**
33 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%}
34 |
35 | {%- set has_ids = false %}
36 | {%- if author.orcid is defined and author.orcid is not none %}
37 | {%- set has_ids = true %}
38 | {.inline_icon width=16 height=16}
39 | [{{author.orcid}}](https://orcid.org/{{author.orcid}})
40 | {%- endif %}
41 | {%- if author.github is defined and author.github is not none %}
42 | {%- set has_ids = true %}
43 | · {.inline_icon width=16 height=16}
44 | [{{author.github}}](https://github.com/{{author.github}})
45 | {%- endif %}
46 | {%- if author.twitter is defined and author.twitter is not none %}
47 | {%- set has_ids = true %}
48 | · {.inline_icon width=16 height=16}
49 | [{{author.twitter}}](https://twitter.com/{{author.twitter}})
50 | {%- endif %}
51 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %}
52 | {%- set has_ids = true %}
53 | · {.inline_icon width=16 height=16}
54 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}})
55 | {%- endif %}
56 | {%- if has_ids %}
57 |
58 | {%- endif %}
59 |
60 | {%- if author.affiliations is defined and author.affiliations|length %}
61 | {{author.affiliations | join('; ')}}
62 | {%- endif %}
63 | {%- if author.funders is defined and author.funders|length %}
64 | · Funded by {{author.funders | join('; ')}}
65 | {%- endif %}
66 |
67 | {% endfor %}
68 |
69 | ::: {#correspondence}
70 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %}
71 | {% if manubot.authors|map(attribute='corresponding')|select|max -%}
72 | or email to
73 | {% for author in manubot.authors|selectattr("corresponding") -%}
74 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }}
75 | {% endfor %}
76 | {% endif %}
77 | :::
78 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/00.front-matter.md:
--------------------------------------------------------------------------------
1 | {##
2 | This file contains a Jinja2 front-matter template that adds version and authorship information.
3 | Changing the Jinja2 templates in this file may cause incompatibility with Manubot updates.
4 | Pandoc automatically inserts title from metadata.yaml, so it is not included in this template.
5 | ##}
6 |
7 | {## Uncomment & edit the following line to reference to a preprinted or published version of the manuscript.
8 | _A DOI-citable version of this manuscript is available at _.
9 | ##}
10 |
11 | {## Template to insert build date and source ##}
12 |
13 | This manuscript
14 | {% if manubot.ci_source is defined and manubot.ci_source.provider == "appveyor" -%}
15 | ([permalink]({{manubot.ci_source.artifact_url}}))
16 | {% elif manubot.html_url_versioned is defined -%}
17 | ([permalink]({{manubot.html_url_versioned}}))
18 | {% endif -%}
19 | was automatically generated
20 | {% if manubot.ci_source is defined -%}
21 | from [{{manubot.ci_source.repo_slug}}@{{manubot.ci_source.commit | truncate(length=7, end='', leeway=0)}}](https://github.com/{{manubot.ci_source.repo_slug}}/tree/{{manubot.ci_source.commit}})
22 | {% endif -%}
23 | on {{manubot.generated_date_long}}.
24 |
25 |
26 | {% if manubot.date_long != manubot.generated_date_long -%}
27 | Published: {{manubot.date_long}}
28 | {% endif %}
29 |
30 | ## Authors
31 |
32 | {## Template for listing authors ##}
33 | {% for author in manubot.authors %}
34 | + **{{author.name}}**
35 | {% if author.corresponding is defined and author.corresponding == true -%}^[✉](#correspondence)^{%- endif -%}
36 |
37 | {%- set has_ids = false %}
38 | {%- if author.orcid is defined and author.orcid is not none %}
39 | {%- set has_ids = true %}
40 | {.inline_icon width=16 height=16}
41 | [{{author.orcid}}](https://orcid.org/{{author.orcid}})
42 | {%- endif %}
43 | {%- if author.github is defined and author.github is not none %}
44 | {%- set has_ids = true %}
45 | · {.inline_icon width=16 height=16}
46 | [{{author.github}}](https://github.com/{{author.github}})
47 | {%- endif %}
48 | {%- if author.twitter is defined and author.twitter is not none %}
49 | {%- set has_ids = true %}
50 | · {.inline_icon width=16 height=16}
51 | [{{author.twitter}}](https://twitter.com/{{author.twitter}})
52 | {%- endif %}
53 | {%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %}
54 | {%- set has_ids = true %}
55 | · {.inline_icon width=16 height=16}
56 | [\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}})
57 | {%- endif %}
58 | {%- if has_ids %}
59 |
60 | {%- endif %}
61 |
62 | {%- if author.affiliations is defined and author.affiliations|length %}
63 | {{author.affiliations | join('; ')}}
64 | {%- endif %}
65 | {%- if author.funders is defined and author.funders|length %}
66 | · Funded by {{author.funders | join('; ')}}
67 | {%- endif %}
68 |
69 | {% endfor %}
70 |
71 | ::: {#correspondence}
72 | ✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %}
73 | {% if manubot.authors|map(attribute='corresponding')|select|max -%}
74 | or email to
75 | {% for author in manubot.authors|selectattr("corresponding") -%}
76 | {{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }}
77 | {% endfor %}
78 | {% endif %}
79 | :::
80 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/04.05.01.crispr.md:
--------------------------------------------------------------------------------
1 | ### LVs link genes that alter lipid accumulation with relevant traits and tissues
2 |
3 | Our first experiment attempted to answer whether genes in a disease-relevant LV could represent potential therapeutic targets.
4 | For this, the first step was to obtain a set of genes strongly associated with a phenotype of interest.
5 | Therefore, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation ([Methods](#sec:methods:crispr)).
6 | From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids:
7 | a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*;
8 | and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary Data 2).
9 |
10 |
11 | ![
12 | **Tissues and traits associated with a gene module related to lipid metabolism (LV246).**
13 |
14 | **a)** Top cell types/tissues in which LV246's genes are expressed.
15 | Values in the $y$-axis come from matrix $\mathbf{B}$ in the MultiPLIER models (Figure {@fig:entire_process}b, see Methods).
16 | In the $x$-axis, cell types/tissues are sorted by the maximum sample value.
17 |
18 | **b)** Gene-trait associations (unadjusted $p$-values from S-MultiXcan [@doi:10.1371/journal.pgen.1007889]; threshold at -log($p$)=10) and colocalization probability (fastENLOC) for the top traits in LV246.
19 | The top 40 genes in LV246 are shown, sorted by their LV weight (matrix $\mathbf{Z}$), from largest (the top gene *SCD*) to smallest (*FAR2*);
20 | *DGAT2* and *ACACA*, in boldface, are two of the six high-confidence genes in the lipids-increasing gene set from the CRISPR screen.
21 | Cardiovascular-related traits are in boldface.
22 |
23 | SGBS: Simpson Golabi Behmel Syndrome;
24 | CH2DB: CH2 groups to double bonds ratio;
25 | HDL: high-density lipoprotein;
26 | RCP: locus regional colocalization probability.
27 |
28 | ](images/lvs_analysis/lv246/lv246.svg "LV246 TWAS plot"){#fig:lv246 width="100%"}
29 |
30 |
31 | Next, we analyzed all 987 LVs using Fast Gene Set Enrichment Analysis (FGSEA) [@doi:10.1101/060012], and found 15 LVs nominally enriched (unadjusted *P* < 0.01) with these lipid-altering gene-sets (Tables @tbl:sup:lipids_crispr:modules_enriched_increase and @tbl:sup:lipids_crispr:modules_enriched_decrease).
32 | Among those with reliable sample metadata, LV246, the top LV associated with the lipids-increasing gene-set, contained genes mainly co-expressed in adipose tissue (Figure {@fig:lv246}a), which plays a key role in coordinating and regulating lipid metabolism.
33 | Using our regression framework across all traits in PhenomeXcan, we found that gene weights for this LV were predictive of gene associations for plasma lipids, high cholesterol, and Alzheimer's disease (Table @tbl:sup:phenomexcan_assocs:lv246, FDR < 1e-23).
34 | These lipids-related associations also replicated across the 309 traits in eMERGE (Table @tbl:sup:emerge_assocs:lv246), where LV246 was significantly associated with hypercholesterolemia (phecode: 272.11, FDR < 4e-9), hyperlipidemia (phecode: 272.1, FDR < 4e-7) and disorders of lipoid metabolism (phecode: 272, FDR < 4e-7).
35 |
36 |
37 | Two high-confidence genes from our CRISPR screening, *DGAT2* and *ACACA*, are responsible for encoding enzymes for triglycerides and fatty acid synthesis and were among the highest-weighted genes of LV246 (Figure {@fig:lv246}b, in boldface).
38 | However, in contrast to other members of LV246, *DGAT2* and *ACACA* were not associated nor colocalized with any of the cardiovascular-related traits and thus would not have been prioritized by TWAS alone;
39 | instead, other members of LV246, such as *SCD*, *LPL*, *FADS2*, *HMGCR*, and *LDLR*, were significantly associated and colocalized with lipid-related traits.
40 | This lack of association of two high-confidence genes from our CRISPR screen might be explained from an omnigenic point of view [@doi:10.1016/j.cell.2019.04.014].
41 | Assuming that the TWAS models for *DGAT2* and *ACACA* capture all common *cis*-eQTLs (the only genetic component of gene expression that TWAS can capture) and there are no rare *cis*-eQTLs, these two genes might represent "core" genes (i.e., they directly affect the trait with no mediated regulation of other genes), and many of the rest in the LV are "peripheral" genes that *trans*-regulate them.
42 |
43 |
--------------------------------------------------------------------------------
/tests/test_model_providers.py:
--------------------------------------------------------------------------------
1 | import os
2 | from unittest import mock
3 | from manubot_ai_editor import env_vars
4 | import pytest
5 |
6 | from manubot_ai_editor.model_providers import BaseModelProvider, MODEL_PROVIDERS
7 |
8 |
9 | @pytest.mark.parametrize(
10 | "provider",
11 | MODEL_PROVIDERS.values(),
12 | )
13 | def test_model_provider_fields(provider: BaseModelProvider):
14 | """
15 | Tests that each model provider has:
16 | - a default model engine
17 | - a non-empty API key environment variable (if applicable)
18 | - clients for the 'chat' and 'completions' endpoints
19 | - an endpoint for the default model engine
20 | - at least one model available
21 | """
22 |
23 | # check that each provider has a default model engine
24 | default_engine = provider.default_model_engine()
25 | assert default_engine is not None and len(default_engine) > 0
26 |
27 | # check that for providers that have API keys, they're
28 | # set to a non-empty string
29 | api_key = provider.api_key_env_var()
30 | assert api_key is None or api_key.strip() != ""
31 |
32 | # test that each provider provides both the 'chat' and 'completions'
33 | # endpoints
34 | clients = provider.clients()
35 | assert set(clients.keys()) == {"chat", "completions"}
36 |
37 | # test that the endpoint for the default model engine is valid
38 | endpoint = provider.endpoint_for_model(default_engine)
39 | assert endpoint in clients
40 |
41 | # check that there's at least one model available from the provider
42 | assert len(provider.get_models()) > 0
43 |
44 |
45 | @pytest.mark.parametrize(
46 | "provider",
47 | MODEL_PROVIDERS.values(),
48 | )
49 | def test_model_provider_specific_key_resolution(provider: BaseModelProvider):
50 | """
51 | Tests that the model provider correctly resolves a provider-specific API key
52 | from the environment variables. If it's not required by the provider, checks
53 | that the key is set to None.
54 | """
55 |
56 | api_key_var = provider.api_key_env_var()
57 |
58 | if api_key_var is None:
59 | # ensure that providers that don't require an API key
60 | # resolve None for the key
61 | assert provider.resolve_api_key() is None
62 | else:
63 | # if the provider does require a key, check that the provider-specific
64 | # key is resolved
65 | with mock.patch.dict("os.environ", {api_key_var: "1234"}):
66 | assert provider.resolve_api_key() == "1234"
67 |
68 |
69 | @pytest.mark.parametrize(
70 | "provider",
71 | MODEL_PROVIDERS.values(),
72 | )
73 | @mock.patch.dict("os.environ", {env_vars.PROVIDER_API_KEY: "1234"})
74 | def test_model_provider_generic_key_resolution(provider: BaseModelProvider):
75 | """
76 | Tests that the model provider correctly resolves the generic API key from
77 | the environment variables in the absence of a provider-specific key. If it's
78 | not required by the provider, checks that the key is set to None.
79 | """
80 |
81 | with mock.patch.dict("os.environ"):
82 | # remove the provider-specific key to make sure we're checking generic
83 | # key resolution
84 | if (key := provider.api_key_env_var()) is not None and key in os.environ:
85 | del os.environ[key]
86 |
87 | # check that the generic key is used
88 | if provider.api_key_env_var() is not None:
89 | assert provider.resolve_api_key() == "1234"
90 | else:
91 | assert provider.resolve_api_key() is None
92 |
93 |
94 | @pytest.mark.parametrize(
95 | "provider",
96 | MODEL_PROVIDERS.values(),
97 | )
98 | @mock.patch.dict("os.environ", {env_vars.PROVIDER_API_KEY: "1234"})
99 | def test_model_provider_get_models(provider: BaseModelProvider):
100 | """
101 | Tests that the model provider can correctly retrieve the list of models
102 | from the cache, and that the default language model for each provider
103 | is in that list.
104 | """
105 |
106 | with mock.patch.dict("os.environ"):
107 | # remove the provider-specific key to ensure that a valid key doesn't
108 | # interfere with checking the cache
109 | if (key := provider.api_key_env_var()) is not None and key in os.environ:
110 | del os.environ[key]
111 |
112 | # check that we can find the default model in each provider's list of
113 | # models
114 | default_model = provider.default_model_engine()
115 | assert default_model is not None
116 | assert default_model in provider.get_models()
117 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/08.01.methods.ccc.md:
--------------------------------------------------------------------------------
1 | ## Methods
2 |
3 | The code needed to reproduce all of our analyses and generate the figures is available in [https://github.com/greenelab/ccc](https://github.com/greenelab/ccc).
4 | We provide scripts to download the required data and run all the steps.
5 | A Docker image is provided to use the same runtime environment.
6 |
7 |
8 | ### The CCC algorithm {#sec:ccc_algo .page_break_before}
9 |
10 | The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects.
11 | CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match.
12 | For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$.
13 | Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed using any measure of similarity between partitions, like the adjusted Rand index (ARI) [@doi:10.1007/BF01908075].
14 | In that case, it will return the maximum value (1.0 in the case of ARI).
15 | Note that the same value of $k$ might not be the right one to find a relationship between any two features.
16 | For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other).
17 | If we used only two clusters instead, CCC would return a similarity value of 0.02.
18 | Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$.
19 |
20 | {width="75%"}
22 |
23 | The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$.
24 | Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI.
25 | Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17).
26 |
27 |
28 | Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported.
29 | If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$.
30 | If the feature is categorical (lines 7 to 9), the categories are used to group objects together.
31 | Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order.
32 |
33 |
34 | For all our analyses we used $k_{\mathrm{max}}=10$.
35 | This means that for each gene pair, 18 partitions are generated (9 for each gene, from $k=2$ to $k=10$), and 81 ARI comparisons are performed.
36 | Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships.
37 | Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired.
38 | In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed.
39 | In this regard, our Python implementation of CCC provides flexibility in specifying $k_{\mathrm{max}}$.
40 | For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters.
41 |
42 |
43 | For a single pair of features (genes in our study), generating partitions or computing their similarity can be parallelized.
44 | We used three CPU cores in our analyses to speed up the computation of CCC.
45 | A future improved implementation of CCC could potentially use graphical processing units (GPU) to parallelize its computation further.
46 |
47 |
48 | A Python implementation of CCC (optimized with `numba` [@doi:10.1145/2833157.2833162]) can be found in our Github repository [@url:https://github.com/greenelab/clustermatch-gene-expr], as well as a package published in the Python Package Index (PyPI) that can be easily installed.
49 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/08.01.meths.md:
--------------------------------------------------------------------------------
1 | ## Methods
2 |
3 | The code needed to reproduce all of our analyses and generate the figures is available in [https://github.com/greenelab/ccc](https://github.com/greenelab/ccc).
4 | We provide scripts to download the required data and run all the steps.
5 | A Docker image is provided to use the same runtime environment.
6 |
7 |
8 | ### The CCC algorithm {#sec:ccc_algo .page_break_before}
9 |
10 | The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects.
11 | CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match.
12 | For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$.
13 | Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed using any measure of similarity between partitions, like the adjusted Rand index (ARI) [@doi:10.1007/BF01908075].
14 | In that case, it will return the maximum value (1.0 in the case of ARI).
15 | Note that the same value of $k$ might not be the right one to find a relationship between any two features.
16 | For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other).
17 | If we used only two clusters instead, CCC would return a similarity value of 0.02.
18 | Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$.
19 |
20 | {width="75%"}
22 |
23 | The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$.
24 | Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI.
25 | Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17).
26 |
27 |
28 | Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported.
29 | If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$.
30 | If the feature is categorical (lines 7 to 9), the categories are used to group objects together.
31 | Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order.
32 |
33 |
34 | For all our analyses we used $k_{\mathrm{max}}=10$.
35 | This means that for each gene pair, 18 partitions are generated (9 for each gene, from $k=2$ to $k=10$), and 81 ARI comparisons are performed.
36 | Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships.
37 | Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired.
38 | In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed.
39 | In this regard, our Python implementation of CCC provides flexibility in specifying $k_{\mathrm{max}}$.
40 | For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters.
41 |
42 |
43 | For a single pair of features (genes in our study), generating partitions or computing their similarity can be parallelized.
44 | We used three CPU cores in our analyses to speed up the computation of CCC.
45 | A future improved implementation of CCC could potentially use graphical processing units (GPU) to parallelize its computation further.
46 |
47 |
48 | A Python implementation of CCC (optimized with `numba` [@doi:10.1145/2833157.2833162]) can be found in our Github repository [@url:https://github.com/greenelab/clustermatch-gene-expr], as well as a package published in the Python Package Index (PyPI) that can be easily installed.
49 |
--------------------------------------------------------------------------------
/libs/manubot_ai_editor/env_vars.py:
--------------------------------------------------------------------------------
1 | """
2 | This file contains environment variables names used by manubot-ai-editor
3 | package. They allow to specify different parameters when calling the
4 | OpenAI model, such as the language model or the maximum tokens per request
5 | (see more details in https://beta.openai.com/docs/api-reference/completions/create).
6 |
7 | If you are using our GitHub Actions workflow provided by manubot/rootstock, you need
8 | to modify the "Revise manuscript" step in the workflow file (.github/workflows/ai-revision.yaml)
9 | by adding the environment variable name specificed in the _value_ of the variables. For instance,
10 | if you want to provide a custom prompt, then you need to add a line like this to the workflow:
11 |
12 | AI_EDITOR_CUSTOM_PROMPT="proofread the following paragraph"
13 | """
14 |
15 | # generic provider API key to use when a provider-specific API key is not
16 | # provided
17 | PROVIDER_API_KEY = "PROVIDER_API_KEY"
18 |
19 | # provider-specific API key overrides
20 | OPENAI_API_KEY = "OPENAI_API_KEY"
21 | ANTHROPIC_API_KEY = "ANTHROPIC_API_KEY"
22 |
23 | # model provider to use, e.g. "openai" or "anthropic"
24 | MODEL_PROVIDER = "AI_EDITOR_MODEL_PROVIDER"
25 |
26 | # Language model to use. For example, "text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-0301", etc
27 | # The tool currently supports the "chat/completions" and "completions" endpoints, and you can check
28 | # compatible openai models here: https://platform.openai.com/docs/models/model-endpoint-compatibility
29 | # anthropic models are here: https://docs.anthropic.com/en/docs/about-claude/models
30 | LANGUAGE_MODEL = "AI_EDITOR_LANGUAGE_MODEL"
31 |
32 | # Model parameter: max_tokens
33 | MAX_TOKENS_PER_REQUEST = "AI_EDITOR_MAX_TOKENS_PER_REQUEST"
34 |
35 | # Model parameter: temperature
36 | TEMPERATURE = "AI_EDITOR_TEMPERATURE"
37 |
38 | # Model parameter: top_p
39 | TOP_P = "AI_EDITOR_TOP_P"
40 |
41 | # Model parameter: presence_penalty
42 | PRESENCE_PENALTY = "AI_EDITOR_PRESENCE_PENALTY"
43 |
44 | # Model parameter: frequency_penalty
45 | FREQUENCY_PENALTY = "AI_EDITOR_FREQUENCY_PENALTY"
46 |
47 | # Model parameter: best_of
48 | BEST_OF = "AI_EDITOR_BEST_OF"
49 |
50 | # It allows to specify a JSON string, where keys are filenames and values are
51 | # section names. For example: '{"01.intro.md": "introduction"}'
52 | # Possible values for section names are: "abstract", "introduction",
53 | # "results", "discussion", "conclusions", "methods", and "supplementary material".
54 | # Take a look at function 'get_prompt' in 'libs/manubot_ai_editor/models.py'
55 | # to see which prompts are used for each section.
56 | # Although the AI Editor tries to infer the section name from the filename,
57 | # sometimes filenames are not descriptive enough (e.g., "01.intro.md" or
58 | # "02.review.md" might indicate an introduction).
59 | # Mapping filenames to section names is useful to provide more context to the
60 | # AI model when revising a paragraph. For example, for the introduction, prompts
61 | # contain sentences to preserve most of the citations to other papers.
62 | SECTIONS_MAPPING = "AI_EDITOR_FILENAME_SECTION_MAPPING"
63 |
64 | # Sometimes the AI model returns an empty paragraph. Usually, this is resolved
65 | # by running again the model. The AI Editor will try five (5) times in these
66 | # cases. This variable allows to specify the number of retries.
67 | RETRY_COUNT = "AI_EDITOR_RETRY_COUNT"
68 |
69 | # If specified, only these file names will be revised. Multiple files can be
70 | # specified, separated by commas. For example: "01.intro.md,02.review.md"
71 | FILENAMES_TO_REVISE = "AI_EDITOR_FILENAMES_TO_REVISE"
72 |
73 | # It allows to specify a single, custom prompt for all sections. For example:
74 | # "proofread and revise the following paragraph"; in this case, the tool will automatically
75 | # append the characters ':\n\n' followed by the paragraph.
76 | # It is also possible to include placeholders in the prompt, which will be replaced
77 | # by the corresponding values. For example, "proofread and revise the following
78 | # paragraph from the section {section_name} of a scientific manuscript with title '{title}'".
79 | # The complete list of placeholders is: {paragraph_text}, {section_name},
80 | # {title}, {keywords}.
81 | CUSTOM_PROMPT = "AI_EDITOR_CUSTOM_PROMPT"
82 |
83 | # Specifies the source and destination encodings of input and output markdown
84 | # files. Behavior is as follows:
85 | # - If neither SRC_ENCODING nor DEST_ENCODING are specified, the tool will
86 | # attempt to identify the encoding using the charset_normalizer library and
87 | # use that encoding to both read and write the output files.
88 | # - If only SRC_ENCODING is specified, it will be used to both read and write
89 | # the files.
90 | # - If only DEST_ENCODING is specified, it will be used to write the output
91 | # files, and the input files will be read using the encoding identified by
92 | # charset_normalizer.
93 | SRC_ENCODING = "AI_EDITOR_SRC_ENCODING"
94 | DEST_ENCODING = "AI_EDITOR_DEST_ENCODING"
95 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/04.05.results_intro.md:
--------------------------------------------------------------------------------
1 | ### A robust and efficient not-only-linear dependence coefficient
2 |
3 | {#fig:datasets_rel width="100%"}
11 |
12 | The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values.
13 | The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**.
14 | In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters).
15 | Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1.
16 | Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo).
17 |
18 |
19 | We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns.
20 | In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation).
21 | This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone.
22 |
23 |
24 | Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier.
25 | In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship.
26 | Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman.
27 | In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern.
28 | Anscombe IV shows a vertical line of data points where $x$ values are almost constant except for one outlier.
29 | This outlier does not influence CCC as it does for Pearson or Spearman.
30 | Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$.
31 | This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$.
32 | The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater.
33 | These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns.
34 | However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness.
35 |
36 |
37 | We simulated additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273].
38 | For the random/independent pair of variables, all coefficients correctly agree with a value close to zero.
39 | The non-coexistence pattern, captured by all coefficients, represents a case where one gene ($x$) might be expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene).
40 | For the other two examples (quadratic and two-lines), Pearson and Spearman do not capture the nonlinear pattern between variables $x$ and $y$.
41 | These patterns also show how CCC uses different degrees of complexity to capture the relationships.
42 | For the quadratic pattern, for example, CCC separates $x$ into more clusters (four in this case) to reach the maximum ARI.
43 | The two-lines example shows two embedded linear relationships with different slopes, which neither Pearson nor Spearman detect ($p=-0.12$ and $s=0.05$, respectively).
44 | Here, CCC increases the complexity of the model by using eight clusters for $x$ and six for $y$, resulting in $c=0.31$.
45 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/04.05.res.md:
--------------------------------------------------------------------------------
1 | ### A robust and efficient not-only-linear dependence coefficient
2 |
3 | {#fig:datasets_rel width="100%"}
11 |
12 | The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values.
13 | The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**.
14 | In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters).
15 | Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1.
16 | Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo).
17 |
18 |
19 | We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns.
20 | In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation).
21 | This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone.
22 |
23 |
24 | Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier.
25 | In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship.
26 | Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman.
27 | In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern.
28 | Anscombe IV shows a vertical line of data points where $x$ values are almost constant except for one outlier.
29 | This outlier does not influence CCC as it does for Pearson or Spearman.
30 | Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$.
31 | This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$.
32 | The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater.
33 | These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns.
34 | However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness.
35 |
36 |
37 | We simulated additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273].
38 | For the random/independent pair of variables, all coefficients correctly agree with a value close to zero.
39 | The non-coexistence pattern, captured by all coefficients, represents a case where one gene ($x$) might be expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene).
40 | For the other two examples (quadratic and two-lines), Pearson and Spearman do not capture the nonlinear pattern between variables $x$ and $y$.
41 | These patterns also show how CCC uses different degrees of complexity to capture the relationships.
42 | For the quadratic pattern, for example, CCC separates $x$ into more clusters (four in this case) to reach the maximum ARI.
43 | The two-lines example shows two embedded linear relationships with different slopes, which neither Pearson nor Spearman detect ($p=-0.12$ and $s=0.05$, respectively).
44 | Here, CCC increases the complexity of the model by using eight clusters for $x$ and six for $y$, resulting in $c=0.31$.
45 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms"
3 | keywords:
4 | - genetic studies
5 | - functional genomics
6 | - gene co-expression
7 | - therapeutic targets
8 | - drug repurposing
9 | - clustering of complex traits
10 | lang: en-US
11 | authors:
12 | - name: Milton Pividori
13 | github: miltondp
14 | initials: MP
15 | orcid: 0000-0002-3035-4403
16 | twitter: miltondp
17 | email: miltondp@gmail.com
18 | affiliations:
19 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
20 | funders:
21 | - The Gordon and Betty Moore Foundation GBMF 4552
22 | - The National Human Genome Research Institute (R01 HG010067)
23 | - The National Human Genome Research Institute (K99HG011898)
24 |
25 | - name: Sumei Lu
26 | affiliations:
27 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
28 |
29 | - name: Binglan Li
30 | orcid: 0000-0002-0103-6107
31 | affiliations:
32 | - Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
33 |
34 | - name: Chun Su
35 | orcid: 0000-0001-6388-8666
36 | github: sckinta
37 | affiliations:
38 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
39 |
40 | - name: Matthew E. Johnson
41 | affiliations:
42 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
43 |
44 | - name: Wei-Qi Wei
45 | affiliations:
46 | - Vanderbilt University Medical Center
47 |
48 | - name: Qiping Feng
49 | orcid: 0000-0002-6213-793X
50 | affiliations:
51 | - Vanderbilt University Medical Center
52 |
53 | - name: Bahram Namjou
54 | affiliations:
55 | - Cincinnati Children's Hospital Medical Center
56 |
57 | - name: Krzysztof Kiryluk
58 | orcid: 0000-0002-5047-6715
59 | twitter: kirylukk
60 | affiliations:
61 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians & Surgeons, Columbia University, New York, New York
62 |
63 | - name: Iftikhar Kullo
64 | affiliations:
65 | - Mayo Clinic
66 |
67 | - name: Yuan Luo
68 | affiliations:
69 | - Northwestern University
70 |
71 | - name: Blair D. Sullivan
72 | affiliations:
73 | - School of Computing, University of Utah, Salt Lake City, UT, USA
74 |
75 | - name: Benjamin F. Voight
76 | orcid: 0000-0002-6205-9994
77 | twitter: bvoight28
78 | github: bvoight
79 | affiliations:
80 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
81 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
82 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
83 |
84 | - name: Carsten Skarke
85 | orcid: 0000-0001-5145-3681
86 | twitter: CarstenSkarke
87 | affiliations:
88 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
89 |
90 | - name: Marylyn D. Ritchie
91 | initials: MDR
92 | orcid: 0000-0002-1208-1720
93 | twitter: MarylynRitchie
94 | email: marylyn@pennmedicine.upenn.edu
95 | affiliations:
96 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
97 |
98 | - name: Struan F.A. Grant
99 | email: grants@chop.edu
100 | orcid: 0000-0003-2025-5302
101 | twitter: STRUANGRANT
102 | affiliations:
103 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
104 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
105 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
106 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
107 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
108 |
109 | - name: Casey S. Greene
110 | github: cgreene
111 | initials: CSG
112 | orcid: 0000-0001-8713-9213
113 | twitter: GreeneScientist
114 | email: casey.s.greene@cuanschutz.edu
115 | affiliations:
116 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
117 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
118 | funders:
119 | - The Gordon and Betty Moore Foundation (GBMF 4552)
120 | - The National Human Genome Research Institute (R01 HG010067)
121 | - The National Cancer Institute (R01 CA237170)
122 | corresponding: true
123 |
--------------------------------------------------------------------------------
/docs/env-vars.md:
--------------------------------------------------------------------------------
1 | # Manubot AI Editor Environment Variables
2 |
3 | Manubot AI Editor provides a variety of options to customize the revision
4 | process. These options are exposed as environment variables, all of which are
5 | prefixed with `AI_EDITOR_`.
6 |
7 | The following environment variables are supported, organized into categories:
8 |
9 | ## Provider Configuration
10 |
11 | This tool refers to services that provide LLMs, such as OpenAI or Anthropic, as
12 | "providers".
13 |
14 | - `AI_EDITOR_MODEL_PROVIDER`: Specifies the provider; currently, the values we
15 | support are "openai" for OpenAI and "anthropic" for Anthropic.
16 |
17 | ## Provider API Key Configuration
18 |
19 | For providers that require API keys, you can specify an API key specific to that provider via an environment variable named `_API_KEY`.
20 | For example, for OpenAI, the API key variable would be named `OPENAI_API_KEY` and for Anthropic, it would be `ANTHROPIC_API_KEY`.
21 |
22 | Alternatively, you can use the environment variable `PROVIDER_API_KEY` to set an API key that will be used for all providers.
23 | If both a provider-specific key and `PROVIDER_API_KEY` are set, the provider-specific key will take precedence.
24 |
25 | ## Model Configuration
26 |
27 | - `AI_EDITOR_LANGUAGE_MODEL`: Language model to use. For example,
28 | "text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-0301", etc. The tool
29 | currently supports the "chat/completions", "completions", and "edits" endpoints,
30 | and you can check compatible models here:
31 | https://platform.openai.com/docs/models/model-endpoint-compatibility
32 | Anthropic models are listed here:
33 | https://docs.anthropic.com/en/docs/about-claude/models/all-models
34 | - `AI_EDITOR_MAX_TOKENS_PER_REQUEST`: Model parameter: `max_tokens`
35 | - `AI_EDITOR_TEMPERATURE`: Model parameter: `temperature`
36 | - `AI_EDITOR_TOP_P`: Model parameter: `top_p`
37 | - `AI_EDITOR_PRESENCE_PENALTY`: Model parameter: `presence_penalty`
38 | - `AI_EDITOR_FREQUENCY_PENALTY`: Model parameter: `frequency_penalty`
39 | - `AI_EDITOR_BEST_OF`: Model parameter: `best_of`
40 |
41 | ## Prompt and Query Control
42 |
43 | - `AI_EDITOR_FILENAME_SECTION_MAPPING`: Allows the user to specify a JSON
44 | string, where keys are filenames and values are section names. For example:
45 | `{"01.intro.md": "introduction"}` Possible values for section names are:
46 | "abstract", "introduction", "results", "discussion", "conclusions", "methods",
47 | and "supplementary material". Take a look at function `get_prompt()` in
48 | [libs/manubot_ai_editor/models.py](https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/models.py#L256)
49 | to see which prompts are used for each section. Although the AI Editor tries to
50 | infer the section name from the filename, sometimes filenames are not
51 | descriptive enough (e.g., "01.intro.md" or "02.review.md" might indicate an
52 | introduction). Mapping filenames to section names is useful to provide more
53 | context to the AI model when revising a paragraph. For example, for the
54 | introduction, prompts contain sentences to preserve most of the citations to
55 | other papers.
56 | - `AI_EDITOR_RETRY_COUNT`: Sometimes the AI model returns an empty paragraph.
57 | Usually, this is resolved by running again the model. The AI Editor will try
58 | five times in these cases. This variable allows to override the number of
59 | retries from its default of 5.
60 | - `AI_EDITOR_FILENAMES_TO_REVISE`: If specified, only these file names will be
61 | revised. Multiple files can be specified, separated by commas. For example:
62 | "01.intro.md,02.review.md"
63 | - `AI_EDITOR_CUSTOM_PROMPT`: Allows the user to specify a single, custom prompt
64 | for all sections. For example: "proofread and revise the following paragraph";
65 | in this case, the tool will automatically append the characters ':\n\n' followed
66 | by the paragraph. It is also possible to include placeholders in the prompt,
67 | which will be replaced by the corresponding values. For example, "proofread and
68 | revise the following paragraph from the section {section_name} of a scientific
69 | manuscript with title '{title}'". The complete list of placeholders is:
70 | `{paragraph_text}`, `{section_name}`, `{title}`, `{keywords}`.
71 |
72 | ## Encodings
73 |
74 | These vars specify the source and destination encodings of input and output markdown
75 | files. Behavior is as follows:
76 | - If neither `SRC_ENCODING` nor `DEST_ENCODING` are specified, both the input
77 | and output encodings will default to `utf-8`.
78 | - If only `SRC_ENCODING` is specified, it will be used to both read and write
79 | the files. If the special value `_auto_` is used, the tool will attempt to
80 | identify the encoding using the
81 | [charset_normalizer](https://github.com/jawah/charset_normalizer) library,
82 | then use that encoding to both read the input files and write the output
83 | files.
84 | - If only `DEST_ENCODING` is specified, it will be used to write the output
85 | files; the input encoding will be assumed to be `utf-8`.
86 |
87 | The variables:
88 |
89 | - `AI_EDITOR_SRC_ENCODING`: the encoding of the input markdown files
90 | - if empty, defaults to `utf-8`, and
91 | - if `_auto_`, the input encoding is auto-detected.
92 | - `AI_EDITOR_DEST_ENCODING`: the encoding to use when writing the output markdown
93 | files
94 | - if empty, defaults to whatever was used for the source encoding.
95 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/02.introduction.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | New technologies have vastly improved data collection, generating a deluge of information across different disciplines.
4 | This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns.
5 | Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971].
6 | Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109].
7 | Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976].
8 | The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas.
9 | Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research.
10 |
11 |
12 | In transcriptomics, many analyses start with estimating the correlation between genes.
13 | More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540].
14 | The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573].
15 | Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003].
16 | In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342].
17 | These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks.
18 | Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field.
19 |
20 |
21 | The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly.
22 | However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships.
23 | Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505].
24 | MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
25 | However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001].
26 | Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855].
27 | We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899].
28 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables.
29 | CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time.
30 | CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships.
31 | We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions.
32 | To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776].
33 | CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients.
34 | For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples.
35 | We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute.
36 | Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259].
37 | Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories.
38 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc_non_standard_filenames/02.beginning.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | New technologies have vastly improved data collection, generating a deluge of information across different disciplines.
4 | This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns.
5 | Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971].
6 | Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109].
7 | Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976].
8 | The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas.
9 | Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research.
10 |
11 |
12 | In transcriptomics, many analyses start with estimating the correlation between genes.
13 | More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540].
14 | The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573].
15 | Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003].
16 | In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342].
17 | These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks.
18 | Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field.
19 |
20 |
21 | The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly.
22 | However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships.
23 | Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505].
24 | MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
25 | However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001].
26 | Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855].
27 | We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899].
28 | Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables.
29 | CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time.
30 | CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships.
31 | We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions.
32 | To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776].
33 | CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients.
34 | For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples.
35 | We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute.
36 | Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259].
37 | Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories.
38 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms"
3 | date: 2023-09-09 # Defaults to date generated, but can specify like '2022-10-31'.
4 | keywords:
5 | - genetic studies
6 | - functional genomics
7 | - gene co-expression
8 | - therapeutic targets
9 | - drug repurposing
10 | - clustering of complex traits
11 | lang: en-US
12 | authors:
13 | - name: Milton Pividori
14 | github: miltondp
15 | initials: MP
16 | orcid: 0000-0002-3035-4403
17 | twitter: miltondp
18 | mastodon: miltondp
19 | mastodon-server: genomic.social
20 | email: milton.pividori@cuanschutz.edu
21 | affiliations:
22 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
23 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
24 | funders:
25 | - The Gordon and Betty Moore Foundation GBMF 4552
26 | - The National Human Genome Research Institute (R01 HG010067)
27 | - The National Human Genome Research Institute (K99HG011898)
28 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765)
29 |
30 | - name: Sumei Lu
31 | affiliations:
32 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
33 |
34 | - name: Binglan Li
35 | orcid: 0000-0002-0103-6107
36 | affiliations:
37 | - Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.
38 |
39 | - name: Chun Su
40 | orcid: 0000-0001-6388-8666
41 | github: sckinta
42 | affiliations:
43 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
44 |
45 | - name: Matthew E. Johnson
46 | affiliations:
47 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
48 |
49 | - name: Wei-Qi Wei
50 | affiliations:
51 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA
52 |
53 | - name: Qiping Feng
54 | orcid: 0000-0002-6213-793X
55 | affiliations:
56 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA
57 |
58 | - name: Bahram Namjou
59 | affiliations:
60 | - Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
61 |
62 | - name: Krzysztof Kiryluk
63 | orcid: 0000-0002-5047-6715
64 | twitter: kirylukk
65 | affiliations:
66 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians \& Surgeons, Columbia University, New York, NY 10032, USA
67 |
68 | - name: Iftikhar Kullo
69 | affiliations:
70 | - Mayo Clinic, Rochester, MN 55905, USA
71 |
72 | - name: Yuan Luo
73 | orcid: 0000-0003-0195-7456
74 | affiliations:
75 | - Northwestern University, Chicago, IL 60611, USA
76 |
77 | - name: Blair D. Sullivan
78 | github: bdsullivan
79 | orcid: 0000-0001-7720-6208
80 | twitter: blairdsullivan
81 | affiliations:
82 | - Kahlert School of Computing, University of Utah, Salt Lake City, UT 84112, USA
83 |
84 | - name: Benjamin F. Voight
85 | orcid: 0000-0002-6205-9994
86 | twitter: bvoight28
87 | github: bvoight
88 | affiliations:
89 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
90 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
91 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
92 |
93 | - name: Carsten Skarke
94 | orcid: 0000-0001-5145-3681
95 | twitter: CarstenSkarke
96 | affiliations:
97 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
98 |
99 | - name: Marylyn D. Ritchie
100 | initials: MDR
101 | orcid: 0000-0002-1208-1720
102 | twitter: MarylynRitchie
103 | email: marylyn@pennmedicine.upenn.edu
104 | affiliations:
105 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
106 |
107 | - name: Struan F.A. Grant
108 | email: grants@chop.edu
109 | orcid: 0000-0003-2025-5302
110 | twitter: STRUANGRANT
111 | affiliations:
112 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
113 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
114 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
115 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
116 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
117 |
118 | - name: Casey S. Greene
119 | github: cgreene
120 | initials: CSG
121 | orcid: 0000-0001-8713-9213
122 | twitter: GreeneScientist
123 | mastodon: greenescientist
124 | mastodon-server: genomic.social
125 | email: casey.s.greene@cuanschutz.edu
126 | affiliations:
127 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
128 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
129 | funders:
130 | - The Gordon and Betty Moore Foundation (GBMF 4552)
131 | - The National Human Genome Research Institute (R01 HG010067)
132 | - The National Cancer Institute (R01 CA237170)
133 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765)
134 | corresponding: true
135 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full_only_first_para/content/metadata.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Projecting genetic associations through gene expression patterns highlights disease etiology and drug mechanisms"
3 | date: 2023-09-09 # Defaults to date generated, but can specify like '2022-10-31'.
4 | keywords:
5 | - genetic studies
6 | - functional genomics
7 | - gene co-expression
8 | - therapeutic targets
9 | - drug repurposing
10 | - clustering of complex traits
11 | lang: en-US
12 | authors:
13 | - name: Milton Pividori
14 | github: miltondp
15 | initials: MP
16 | orcid: 0000-0002-3035-4403
17 | twitter: miltondp
18 | mastodon: miltondp
19 | mastodon-server: genomic.social
20 | email: milton.pividori@cuanschutz.edu
21 | affiliations:
22 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
23 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
24 | funders:
25 | - The Gordon and Betty Moore Foundation GBMF 4552
26 | - The National Human Genome Research Institute (R01 HG010067)
27 | - The National Human Genome Research Institute (K99HG011898)
28 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765)
29 |
30 | - name: Sumei Lu
31 | affiliations:
32 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
33 |
34 | - name: Binglan Li
35 | orcid: 0000-0002-0103-6107
36 | affiliations:
37 | - Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.
38 |
39 | - name: Chun Su
40 | orcid: 0000-0001-6388-8666
41 | github: sckinta
42 | affiliations:
43 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
44 |
45 | - name: Matthew E. Johnson
46 | affiliations:
47 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
48 |
49 | - name: Wei-Qi Wei
50 | affiliations:
51 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA
52 |
53 | - name: Qiping Feng
54 | orcid: 0000-0002-6213-793X
55 | affiliations:
56 | - Vanderbilt University Medical Center, Nashville, TN 37232, USA
57 |
58 | - name: Bahram Namjou
59 | affiliations:
60 | - Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
61 |
62 | - name: Krzysztof Kiryluk
63 | orcid: 0000-0002-5047-6715
64 | twitter: kirylukk
65 | affiliations:
66 | - Department of Medicine, Division of Nephrology, Vagelos College of Physicians \& Surgeons, Columbia University, New York, NY 10032, USA
67 |
68 | - name: Iftikhar Kullo
69 | affiliations:
70 | - Mayo Clinic, Rochester, MN 55905, USA
71 |
72 | - name: Yuan Luo
73 | orcid: 0000-0003-0195-7456
74 | affiliations:
75 | - Northwestern University, Chicago, IL 60611, USA
76 |
77 | - name: Blair D. Sullivan
78 | github: bdsullivan
79 | orcid: 0000-0001-7720-6208
80 | twitter: blairdsullivan
81 | affiliations:
82 | - Kahlert School of Computing, University of Utah, Salt Lake City, UT 84112, USA
83 |
84 | - name: Benjamin F. Voight
85 | orcid: 0000-0002-6205-9994
86 | twitter: bvoight28
87 | github: bvoight
88 | affiliations:
89 | - Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
90 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
91 | - Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
92 |
93 | - name: Carsten Skarke
94 | orcid: 0000-0001-5145-3681
95 | twitter: CarstenSkarke
96 | affiliations:
97 | - Institute for Translational Medicine and Therapeutics, Department of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
98 |
99 | - name: Marylyn D. Ritchie
100 | initials: MDR
101 | orcid: 0000-0002-1208-1720
102 | twitter: MarylynRitchie
103 | email: marylyn@pennmedicine.upenn.edu
104 | affiliations:
105 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
106 |
107 | - name: Struan F.A. Grant
108 | email: grants@chop.edu
109 | orcid: 0000-0003-2025-5302
110 | twitter: STRUANGRANT
111 | affiliations:
112 | - Center for Spatial and Functional Genomics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
113 | - Division of Endocrinology and Diabetes, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
114 | - Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
115 | - Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
116 | - Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
117 |
118 | - name: Casey S. Greene
119 | github: cgreene
120 | initials: CSG
121 | orcid: 0000-0001-8713-9213
122 | twitter: GreeneScientist
123 | mastodon: greenescientist
124 | mastodon-server: genomic.social
125 | email: casey.s.greene@cuanschutz.edu
126 | affiliations:
127 | - Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
128 | - Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
129 | funders:
130 | - The Gordon and Betty Moore Foundation (GBMF 4552)
131 | - The National Human Genome Research Institute (R01 HG010067)
132 | - The National Cancer Institute (R01 CA237170)
133 | - The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765)
134 | corresponding: true
135 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 |
2 | # Contributor Covenant Code of Conduct
3 |
4 | ## Our Pledge
5 |
6 | We as members, contributors, and leaders pledge to make participation in our
7 | community a harassment-free experience for everyone, regardless of age, body
8 | size, visible or invisible disability, ethnicity, sex characteristics, gender
9 | identity and expression, level of experience, education, socio-economic status,
10 | nationality, personal appearance, race, caste, color, religion, or sexual
11 | identity and orientation.
12 |
13 | We pledge to act and interact in ways that contribute to an open, welcoming,
14 | diverse, inclusive, and healthy community.
15 |
16 | ## Our Standards
17 |
18 | Examples of behavior that contributes to a positive environment for our
19 | community include:
20 |
21 | * Demonstrating empathy and kindness toward other people
22 | * Being respectful of differing opinions, viewpoints, and experiences
23 | * Giving and gracefully accepting constructive feedback
24 | * Accepting responsibility and apologizing to those affected by our mistakes,
25 | and learning from the experience
26 | * Focusing on what is best not just for us as individuals, but for the overall
27 | community
28 |
29 | Examples of unacceptable behavior include:
30 |
31 | * The use of sexualized language or imagery, and sexual attention or advances of
32 | any kind
33 | * Trolling, insulting or derogatory comments, and personal or political attacks
34 | * Public or private harassment
35 | * Publishing others' private information, such as a physical or email address,
36 | without their explicit permission
37 | * Other conduct which could reasonably be considered inappropriate in a
38 | professional setting
39 |
40 | ## Enforcement Responsibilities
41 |
42 | Community leaders are responsible for clarifying and enforcing our standards of
43 | acceptable behavior and will take appropriate and fair corrective action in
44 | response to any behavior that they deem inappropriate, threatening, offensive,
45 | or harmful.
46 |
47 | Community leaders have the right and responsibility to remove, edit, or reject
48 | comments, commits, code, wiki edits, issues, and other contributions that are
49 | not aligned to this Code of Conduct, and will communicate reasons for moderation
50 | decisions when appropriate.
51 |
52 | ## Scope
53 |
54 | This Code of Conduct applies within all community spaces, and also applies when
55 | an individual is officially representing the community in public spaces.
56 | Examples of representing our community include using an official email address,
57 | posting via an official social media account, or acting as an appointed
58 | representative at an online or offline event.
59 |
60 | ## Enforcement
61 |
62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
63 | reported to the community leaders responsible for enforcement at
64 | [milton.pividori@cuanschutz.edu](mailto:milton.pividori@cuanschutz.edu).
65 | All complaints will be reviewed and investigated promptly and fairly.
66 |
67 | All community leaders are obligated to respect the privacy and security of the
68 | reporter of any incident.
69 |
70 | ## Enforcement Guidelines
71 |
72 | Community leaders will follow these Community Impact Guidelines in determining
73 | the consequences for any action they deem in violation of this Code of Conduct:
74 |
75 | ### 1. Correction
76 |
77 | **Community Impact**: Use of inappropriate language or other behavior deemed
78 | unprofessional or unwelcome in the community.
79 |
80 | **Consequence**: A private, written warning from community leaders, providing
81 | clarity around the nature of the violation and an explanation of why the
82 | behavior was inappropriate. A public apology may be requested.
83 |
84 | ### 2. Warning
85 |
86 | **Community Impact**: A violation through a single incident or series of
87 | actions.
88 |
89 | **Consequence**: A warning with consequences for continued behavior. No
90 | interaction with the people involved, including unsolicited interaction with
91 | those enforcing the Code of Conduct, for a specified period of time. This
92 | includes avoiding interactions in community spaces as well as external channels
93 | like social media. Violating these terms may lead to a temporary or permanent
94 | ban.
95 |
96 | ### 3. Temporary Ban
97 |
98 | **Community Impact**: A serious violation of community standards, including
99 | sustained inappropriate behavior.
100 |
101 | **Consequence**: A temporary ban from any sort of interaction or public
102 | communication with the community for a specified period of time. No public or
103 | private interaction with the people involved, including unsolicited interaction
104 | with those enforcing the Code of Conduct, is allowed during this period.
105 | Violating these terms may lead to a permanent ban.
106 |
107 | ### 4. Permanent Ban
108 |
109 | **Community Impact**: Demonstrating a pattern of violation of community
110 | standards, including sustained inappropriate behavior, harassment of an
111 | individual, or aggression toward or disparagement of classes of individuals.
112 |
113 | **Consequence**: A permanent ban from any sort of public interaction within the
114 | community.
115 |
116 | ## Attribution
117 |
118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
119 | version 2.1, available at
120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
121 |
122 | Community Impact Guidelines were inspired by
123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
124 |
125 | For answers to common questions about this code of conduct, see the FAQ at
126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
127 | [https://www.contributor-covenant.org/translations][translations].
128 |
129 | [homepage]: https://www.contributor-covenant.org
130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
131 | [Mozilla CoC]: https://github.com/mozilla/diversity
132 | [FAQ]: https://www.contributor-covenant.org/faq
133 | [translations]: https://www.contributor-covenant.org/translations
134 |
--------------------------------------------------------------------------------
/tests/manuscripts/mutator-epistasis/02.introduction.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | Germline mutation rates reflect the complex interplay between DNA proofreading and repair pathways, exogenous sources of DNA damage, and life-history traits.
4 | For example, parental age is an important determinant of mutation rate variability; in many mammalian species, the number of germline *de novo* mutations observed in offspring increases as a function of paternal and maternal age [@PMID:28959963;@PMID:31549960;@PMID:35771663;@PMID:32804933;@PMID:31492841].
5 | Rates of germline mutation accumulation are also variable across human families [@PMID:26656846;@PMID:31549960], likely due to either genetic variation or differences in environmental exposures.
6 | Although numerous protein-coding genes contribute to the maintenance of genome integrity, genetic variants that increase germline mutation rates, known as *mutator alleles*, have proven difficult to discover in mammals.
7 |
8 | The dearth of observed germline mutators in mammalian genomes is not necessarily surprising, since alleles that lead to elevated germline mutation rates would likely have deleterious consequences and be purged by negative selection if their effect sizes are large [@PMID:27739533].
9 | Moreover, germline mutation rates are relatively low, and direct mutation rate measurements require whole-genome sequencing data from both parents and their offspring.
10 | As a result, large-scale association studies — which have been used to map the contributions of common genetic variants to many complex traits — are not currently well-powered to investigate the polygenic architecture of germline mutation rates [@PMID:31964835].
11 |
12 | Despite these challenges, less traditional strategies have been used to identify a small number of mutator alleles in humans, macaques [@doi:10.1101/2023.03.27.534460], and mice.
13 | By focusing on families with rare genetic diseases, a recent study discovered two mutator alleles that led to significantly elevated rates of *de novo* germline mutation in human genomes [@PMID:35545669].
14 | Other groups have observed mutator phenotypes in the germlines and somatic tissues of adults who carry cancer-predisposing inherited mutations in the POLE/POLD1 exonucleases [@PMID:34594041;@PMID:37336879].
15 | Candidate mutator loci were also found by identifying human haplotypes from the Thousand Genomes Project with excess counts of derived alleles in genomic windows [@PMID:28095480].
16 |
17 | In mice, a germline mutator allele was recently discovered by sequencing a large family of inbred mice [@PMID:35545679].
18 | Commonly known as the BXDs, these recombinant inbred lines (RILs) were derived from either F2 or advanced intercrosses of C57BL/6J and DBA/2J, two laboratory strains that exhibit significant differences in their germline mutation spectra [@PMID:33472028;@PMID:30753674].
19 | The BXDs were maintained via brother-sister mating for up to 180 generations, and each BXD therefore accumulated hundreds or thousands of germline mutations on a nearly-homozygous linear mosaic of parental B and D haplotypes.
20 | Due to their husbandry in a controlled laboratory setting, the BXDs were largely free from confounding by environmental heterogeneity, and the effects of selection on *de novo* mutations were attenuated by strict inbreeding [@doi:10.1146/annurev.ecolsys.39.110707.173437].
21 |
22 | In this previous study, whole-genome sequencing data from the BXD family were used to map a quantitative trait locus (QTL) for the C>A mutation rate [@PMID:35545679].
23 | Germline C>A mutation rates were nearly 50% higher in mice with *D* haplotypes at the QTL, likely due to genetic variation in the DNA glycosylase *Mutyh* that reduced the efficacy of oxidative DNA damage repair.
24 | Pathogenic variants of *Mutyh* also appear to act as mutators in normal human germline and somatic tissues [@PMID:35803914;@PMID:30753674].
25 | Importantly, the QTL did not reach genome-wide significance in a scan for variation in overall germline mutation rates, which were only modestly higher in BXDs with *D* alleles, demonstrating the utility of mutation spectrum analysis for mutator allele discovery.
26 | Close examination of the mutation spectrum is likely to be broadly useful for detecting mutator alleles, as genes involved in DNA proofreading and repair often recognize particular sequence motifs or excise specific types of DNA lesions [@PMID:32619789].
27 | Mutation spectra are usually defined in terms of $k$-mer nucleotide context; the 1-mer mutation spectrum, for example, consists of 6 mutation types after collapsing by strand complement (C>T, C>A, C>G, A>T, A>C, A>G), while the 3-mer mutation spectrum contains 96 (each of the 1-mer mutations partitioned by trinucleotide context).
28 |
29 | Although mutation spectrum analysis can enable the discovery of mutator alleles that affect the rates of specific mutation types, early implementations of this strategy have suffered from a few drawbacks.
30 | For example, performing association tests on the rates or fractions of every $k$-mer mutation type can quickly incur a substantial multiple testing burden.
31 | Since germline mutation rates are generally quite low, estimates of $k$-mer mutation type frequencies from individual samples can also be noisy and imprecise.
32 | Moreover, inbreeding duration can vary considerably across samples in populations of RILs; for example, some BXDs were inbred for only 20 generations, while others were inbred for nearly 200.
33 | As a result, the variance of individual $k$-mer mutation rate estimates in those populations will be much higher than if all samples were inbred for the same duration.
34 | We were therefore motivated to develop a statistical method that could overcome the sparsity of *de novo* mutation spectra, eliminate the need to test each $k$-mer mutation type separately, and enable sensitive detection of alleles that influence the germline mutation spectrum.
35 |
36 | Here, we present a new mutation spectrum association test, called "aggregate mutation spectrum distance," that minimizes multiple testing burdens and mitigates the challenges of sparsity in *de novo* mutation datasets.
37 | We leverage this method to re-analyze germline mutation data from the BXD family and find compelling evidence for a second mutator allele that was not detected using previous approaches.
38 | The new allele appears to interact epistatically with the mutator that was previously discovered in the BXDs, further augmenting the C>A germline mutation rate in a subset of inbred mice.
39 | Our observation of epistasis suggests that mild DNA repair deficiencies can compound one another, as mutator alleles chip away at the redundant systems that collectively maintain germline integrity.
40 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/04.10.results_comp.md:
--------------------------------------------------------------------------------
1 | ### The CCC reveals linear and nonlinear patterns in human transcriptomic data
2 |
3 | We next examined the characteristics of these correlation coefficients in gene expression data from GTEx v8 across different tissues.
4 | We selected the top 5,000 genes with the largest variance for our initial analyses on whole blood and then computed the correlation matrix between genes using Pearson, Spearman and CCC (see [Methods](#sec:data_gtex)).
5 |
6 |
7 | We examined the distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs).
8 | CCC (mean=0.14, median=0.08, sd=0.15) has a much more skewed distribution than Pearson (mean=0.31, median=0.24, sd=0.24) and Spearman (mean=0.39, median=0.37, sd=0.26).
9 | The coefficients reach a cumulative set containing 70% of gene pairs at different values (Figure @fig:dist_coefs b), $c=0.18$, $p=0.44$ and $s=0.56$, suggesting that for this type of data, the coefficients are not directly comparable by magnitude, so we used ranks for further comparisons.
10 | In GTEx v8, CCC values were closer to Spearman and vice versa than either was to Pearson (Figure @fig:dist_coefs c).
11 | We also compared the Maximal Information Coefficient (MIC) in this data (see [Supplementary Note 1](#sec:mic)).
12 | We found that CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)).
13 | MIC, an advanced correlation coefficient able to capture general patterns beyond linear relationships, represented a significant step forward in correlation analysis research and has been successfully used in various application domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
14 | These results suggest that our findings for CCC generalize to MIC, therefore, in the subsequent analyses we focus on CCC and linear-only coefficients.
15 |
16 |
17 | {#fig:dist_coefs width="100%"}
23 |
24 |
25 | A closer inspection of gene pairs that were either prioritized or disregarded by these coefficients revealed that they captured different patterns.
26 | We analyzed the agreements and disagreements by obtaining, for each coefficient, the top 30% of gene pairs with the largest correlation values ("high" set) and the bottom 30% ("low" set), resulting in six potentially overlapping categories.
27 | For most cases (76.4%), an UpSet analysis [@doi:10.1109/TVCG.2014.2346248] (Figure @fig:upsetplot_coefs a) showed that the three coefficients agreed on whether there is a strong correlation (42.1%) or there is no relationship (34.3%).
28 | Since Pearson and Spearman are linear-only, and CCC can also capture these patterns, we expect that these concordant gene pairs represent clear linear patterns.
29 | CCC and Spearman agree more on either highly or poorly correlated pairs (4.0% in "high", and 7.0% in "low") than any of these with Pearson (all between 0.3%-3.5% for "high", and 2.8%-5.5% for "low").
30 | In summary, CCC agrees with either Pearson or Spearman in 90.5% of gene pairs by assigning a high or a low correlation value.
31 |
32 | {#fig:upsetplot_coefs width="100%"}
41 |
42 |
43 | While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure @fig:upsetplot_coefs a).
44 | There were also gene pairs with a high Pearson value and either low CCC (1,075), low Spearman (87) or both low CCC and low Spearman values (531).
45 | However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later).
46 | We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees with Pearson, Spearman or both.
47 |
48 | {#fig:gtex_tissues:kdm6a_uty width="95%"}
52 |
53 | The first three gene pairs at the top (*IFNG* - *SDS*, *JUN* - *APOC1*, and *ZDHHC12* - *CCL18*), with high CCC and low Pearson values, appear to follow a non-coexistence relationship: in samples where one of the genes is highly (slightly) expressed, the other is slightly (highly) activated, suggesting a potentially inhibiting effect.
54 | The following three gene pairs (*UTY* - *KDM6A*, *RASSF2* - *CYTIP*, and *AC068580.6* - *KLHL21*) follow patterns combining either two linear or one linear and one independent relationships.
55 | In particular, genes *UTY* and *KDM6A* (paralogs) show a nonlinear relationship where a subset of samples follows a robust linear pattern and another subset has a constant (independent) expression of one gene.
56 | This relationship is explained by the fact that *UTY* is in chromosome Y (Yq11) whereas *KDM6A* is in chromosome X (Xp11), and samples with a linear pattern are males, whereas those with no expression for *UTY* are females.
57 | This combination of linear and independent patterns is captured by CCC ($c=0.29$, above the 80th percentile) but not by Pearson ($p=0.24$, below the 55th percentile) or Spearman ($s=0.10$, below the 15th percentile).
58 | Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs (Figure @fig:gtex_tissues:kdm6a_uty).
59 |
--------------------------------------------------------------------------------
/docs/custom-prompts.md:
--------------------------------------------------------------------------------
1 | # Custom Prompts
2 |
3 | Rather than using the default prompt, you can specify custom prompts for each file in your manuscript.
4 | This can be useful when you want specific sections of your manuscript to be revised in specific ways, or not revised at all.
5 |
6 | There are two ways that you can use the custom prompts system:
7 | 1. You can define your prompts and how they map to your manuscript files in a single file, `ai-revision-prompts.yaml`.
8 | 2. You can create the `ai-revision-prompts.yaml`, but only specify prompts and identifiers, which makes it suitable for sharing with others who have different names for their manuscripts' files.
9 | You would then specify a second file, `ai-revision-config.yaml`, that maps the prompt identifiers to the actual files in your manuscript.
10 |
11 | These files should be placed in the `ci` directory under your manubot root directory.
12 |
13 | See [Functionality Notes](#functionality-notes) later in this document for more information on how to write regular expressions and use placeholders in your prompts.
14 |
15 |
16 | ## Approach 1: Single file
17 |
18 | With this approach, you can define your prompts and how they map to your manuscript files in a single file.
19 | The single file should be named `ai-revision-prompts.yaml` and placed in the `ci` folder.
20 |
21 | The file would look something like the following:
22 |
23 | ```yaml
24 | prompts_files:
25 | # filenames are specified as regular expressions
26 | # in this case, we match a file named exactly 'filename.md'
27 | ^filename\.md$: "Prompt text here"
28 |
29 | # you can use YAML's multi-line string syntax to write longer prompts
30 | # you can also use {placeholders} to include metadata from your manuscript
31 | ^filename\.md$: |
32 | Revise the following paragraph from a manuscript titled {title}
33 | so that it sounds like an academic paper.
34 |
35 | # specifying the special value 'null' will skip revising any files that
36 | # match this regular expression
37 | ^ignore_this_file\.md$: null
38 | ```
39 |
40 | Note that, for each file, the first matching regular expression will determine its prompt or whether the file is skipped.
41 | Even if a file matches multiple regexes, only the first one will be used.
42 |
43 |
44 | ## Approach 2: Prompt file plus configuration file
45 |
46 | In this case, we specify two files, `ai-revision-prompts.yaml` and `ai-revision-config.yaml`.
47 |
48 | The `ai-revision-prompts.yaml` file contains only the prompts and their identifiers.
49 | The top-level element is `prompts` in this case rather than `prompts_files`, as it defines a set of resuable prompts and not prompt-file mappings.
50 |
51 | Here's an example of what the `ai-revision-prompts.yaml` file might look like:
52 | ```yaml
53 | prompts:
54 | intro_prompt: "Prompt text here"
55 | content_prompts: |
56 | Revise the following paragraph from a manuscript titled {title}
57 | so that it sounds like an academic paper.
58 |
59 | my_default: "Revise this paragraph so it sounds nicer."
60 | ```
61 |
62 | The `ai-revision-config.yaml` file maps the prompt identifiers to the actual files in your manuscript.
63 |
64 | An example of the `ai-revision-config.yaml` file:
65 | ```yaml
66 | files:
67 | matchings:
68 | - files:
69 | - ^introduction\.md$
70 | prompt: intro_prompt
71 | - files:
72 | - ^methods\.md$
73 | - ^abstract\.md$
74 | prompt: content_prompts
75 |
76 | # the special value default_prompt is used when no other regex matches
77 | # it also uses a prompt identifier taken from ai-revision-prompts.yaml
78 | default_prompt: my_default
79 |
80 | # any file you want to be skipped can be specified in this list
81 | ignores:
82 | - ^ignore_this_file\.md$
83 | ```
84 |
85 | Multiple regexes can be specified in a list under `files` to match multiple files to a single prompt.
86 |
87 | In this case, the `default_prompt` is used when no other regex matches, and it uses a prompt identifier taken from `ai-revision-prompts.yaml`.
88 |
89 | The `ignores` list specifies files that should be skipped entirely during the revision process; they won't have the default prompt applied to them.
90 |
91 |
92 | ## Functionality Notes
93 |
94 | ### Filenames as Regular Expressions
95 |
96 | Filenames in either approach are specified as regular expressions (aka "regexes").
97 | This allows you to flexibly match multiple files to a prompt with a single expression.
98 |
99 | A simple example: to specify an exact match for, say, `myfile.md`, you'd supply the regular expression `^myfile\.md$`, where:
100 | - `^` matches the beginning of the filename
101 | - `\.` matches a literal period -- otherwise, `.` means "any character"
102 | - `$` matches the end of the filename
103 |
104 | To illustrate why that syntax is important: if you were to write it as `myfile.md`, the `.` would match any character, so it would match `myfileAmd`, `myfile2md`, etc.
105 | Without the `^` and `$`, it would match also match filenames like `asdf_myfile.md`, `myfile.md_asdf`, and `asdf_myfile.md.txt`.
106 |
107 | The benefit of using regexes becomes more apparent when you have multiple files.
108 | For example, say you had three files, `02.prior-work.md`, `02.methods.md`, and `02.results.md`. To match all of these, you could use the expression `^02\..*\.md$`.
109 | This would match any file beginning with `02.` and ending with `.md`.
110 | Here, `.` again indicates "any character" and the `*` means "zero or more of the preceding character; together, they match any sequence of characters.
111 |
112 | You can find more information on how to write regular expressions in [Python's `re` module documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).
113 |
114 |
115 | ### Placeholders
116 |
117 | The prompt text can include metadata from your manuscript, specified in `content/metadata.yaml` in Manubot. Writing
118 | `{placeholder}` into your prompt text will cause it to be replaced with the corresponding value, drawn either
119 | from the manuscript metadata or from the current file/paragraph being revised.
120 |
121 | The following placeholders are available:
122 | - `{title}`: the title of the manuscript, as defined in the metadata
123 | - `{keywords}`: comma-delimited keywords from the manuscript metadata
124 | - `{paragraph_text}`: the text from the current paragraph
125 | - `{section_name}`: the name of the section (which is one of the following values "abstract", "introduction", "results", "discussion", "conclusions", "methods" or "supplementary material"), derived from the filename.
126 |
127 | The `section_name` placeholder works like so:
128 | - if the env var `AI_EDITOR_FILENAME_SECTION_MAPPING` is specified, it will be interpreted as a dictionary mapping filenames to section names.
129 | If a key of the dictionary is included in the filename, the value will be used as the section name.
130 | Also the keys and values can be any string, not just one of the section names mentioned before.
131 | - If the dict mentioned above is unset or the filename doesn't match any of its keys, the filename will be matched against the following values: "introduction", "methods", "results", "discussion", "conclusions" or "supplementary".
132 | If the values are contained within the filename, the section name will be mapped to that value. "supplementary" is replaced with "supplementary material", but the others are used as is.
133 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/04.12.results_giant.md:
--------------------------------------------------------------------------------
1 | ### Replication of gene associations using tissue-specific gene networks from GIANT
2 |
3 | We sought to systematically analyze discrepant scores to assess whether associations were replicated in other datasets besides GTEx.
4 | This is challenging and prone to bias because linear-only correlation coefficients are usually used in gene co-expression analyses.
5 | We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@pmcid:PMC4828725; @url:https://hb.flatironinstitute.org], where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes (see [Methods](#sec:giant)).
6 | Importantly, the version of GIANT used in this study did not include GTEx samples [@url:https://hb.flatironinstitute.org/data], making it an ideal case for replication.
7 | These networks were built from expression and different interaction measurements, including protein-interaction, transcription factor regulation, chemical/genetic perturbations and microRNA target profiles from the Molecular Signatures Database (MSigDB [@pmid:16199517]).
8 | We reasoned that highly-ranked gene pairs using three different coefficients in a single tissue (whole blood in GTEx, Figure @fig:upsetplot_coefs) that represented real patterns should often replicate in a corresponding tissue or related cell lineage using the multi-cell type functional interaction networks in GIANT.
9 | In addition to predicting a network with interactions for a pair of genes, the GIANT web application can also automatically detect a relevant tissue or cell type where genes are predicted to be specifically expressed (the approach uses a machine learning method introduced in [@doi:10.1101/gr.155697.113] and described in [Methods](#sec:giant)).
10 | For example, we obtained the networks in blood and the automatically-predicted cell type for gene pairs *RASSF2* - *CYTIP* (CCC high, Figure @fig:giant_gene_pairs a) and *MYOZ1* - *TNNI2* (Pearson high, Figure @fig:giant_gene_pairs b).
11 | In addition to the gene pair, the networks include other genes connected according to their probability of interaction (up to 15 additional genes are shown), which allows estimating whether genes are part of the same tissue-specific biological process.
12 | Two large black nodes in each network's top-left and bottom-right corners represent our gene pairs.
13 | A green edge means a close-to-zero probability of interaction, whereas a red edge represents a strong predicted relationship between the two genes.
14 | In this example, genes *RASSF2* and *CYTIP* (Figure @fig:giant_gene_pairs a), with a high CCC value ($c=0.20$, above the 73th percentile) and low Pearson and Spearman ($p=0.16$ and $s=0.11$, below the 38th and 17th percentiles, respectively), were both strongly connected to the blood network, with interaction scores of at least 0.63 and an average of 0.75 and 0.84, respectively (Supplementary Table @tbl:giant:weights).
15 | The autodetected cell type for this pair was leukocytes, and interaction scores were similar to the blood network (Supplementary Table @tbl:giant:weights).
16 | However, genes *MYOZ1* and *TNNI2*, with a very high Pearson value ($p=0.97$), moderate Spearman ($s=0.28$) and very low CCC ($c=0.03$), were predicted to belong to much less cohesive networks (Figure @fig:giant_gene_pairs b), with average interaction scores of 0.17 and 0.22 with the rest of the genes, respectively.
17 | Additionally, the autodetected cell type (skeletal muscle) is not related to blood or one of its cell lineages.
18 | These preliminary results suggested that CCC might be capturing blood-specific patterns missed by the other coefficients.
19 |
20 | ![
21 | **Analysis of GIANT tissue-specific predicted networks for gene pairs prioritized by correlation coefficients.**
22 | **a-b)** Two gene pairs prioritized by correlation coefficients (from Figure @fig:upsetplot_coefs b) with their predicted networks in blood (left) and an automatically selected tissue/cell type (right) using the method described in [@doi:10.1101/gr.155697.113].
23 | A node represents a gene and an edge the probability that two genes are part of the same biological process in a specific cell type.
24 | A maximum of 15 genes are shown for each network.
25 | The GIANT web application automatically determined a minimum interaction confidence (edges' weights) to be shown.
26 | These networks can be analyzed online using the following links:
27 | *RASSF2* - *CYTIP* [@url:https://hb.flatironinstitute.org/gene/9770+9595],
28 | *MYOZ1* - *TNNI2* [@url:https://hb.flatironinstitute.org/gene/58529+7136].
29 | **c)** Summary of predicted tissue/cell type networks for gene pairs exclusively prioritized by CCC and Pearson.
30 | The first row combines all gene pairs where CCC is high and Pearson or Spearman are low.
31 | The second row combines all gene pairs where Pearson is high and CCC or Spearman are low.
32 | Bar plots (left) show the number of gene pairs for each predicted tissue/cell type.
33 | Box plots (right) show the average probability of interaction between genes in these predicted tissue-specific networks.
34 | Red indicates CCC-only tissues/cell types, blue are Pearson-only, and purple are shared.
35 | ](images/coefs_comp/giant_networks/top_gene_pairs-main.svg "GIANT network interaction on gene pairs"){#fig:giant_gene_pairs width="100%"}
36 |
37 |
38 | We next performed a systematic evaluation using the top 100 discrepant gene pairs between CCC and the other two coefficients.
39 | For each gene pair prioritized in GTEx (whole blood), we autodetected a relevant cell type using GIANT to assess whether genes were predicted to be specifically expressed in a blood-relevant cell lineage.
40 | For this, we used the top five most commonly autodetected cell types for each coefficient and assessed connectivity in the resulting networks (see [Methods](#sec:giant)).
41 | The top 5 predicted cell types for gene pairs highly ranked by CCC and not by the rest were all blood-specific (Figure @fig:giant_gene_pairs c, top left), including macrophage, leukocyte, natural killer cell, blood and mononuclear phagocyte.
42 | The average probability of interaction between genes in these CCC-ranked networks was significantly higher than the other coefficients (Figure @fig:giant_gene_pairs c, top right), with all medians larger than 67% and first quartiles above 41% across predicted cell types.
43 | In contrast, most Pearson's gene pairs were predicted to be specific to tissues unrelated to blood (Figure @fig:giant_gene_pairs c, bottom left), with skeletal muscle being the most commonly predicted tissue.
44 | The interaction probabilities in these Pearson-ranked networks were also generally lower than in CCC, except for blood-specific gene pairs (Figure @fig:giant_gene_pairs c, bottom right).
45 | The associations exclusively detected by CCC in whole blood from GTEx were more strongly replicated in these independent networks that incorporated multiple data modalities.
46 | CCC-ranked gene pairs not only had high probabilities of belonging to the same biological process but were also predicted to be specifically expressed in blood cell lineages.
47 | Conversely, most Pearson-ranked gene pairs were not predicted to be blood-specific, and their interaction probabilities were relatively low.
48 | This lack of replication in GIANT suggests that top Pearson-ranked gene pairs in GTEx might be driven mainly by outliers, which is consistent with our earlier observations of outlier-driven associations (Figure @fig:upsetplot_coefs b).
49 |
--------------------------------------------------------------------------------
/tests/manuscripts/phenoplier_full/content/02.introduction.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | Genes work together in context-specific networks to carry out different functions [@pmid:19104045; @doi:10.1038/ng.3259].
4 | Variations in these genes can change their functional role and, at a higher level, affect disease-relevant biological processes [@doi:10.1038/s41467-018-06022-6].
5 | In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776; @doi:10.1038/s41586-020-2559-3; @doi:10.1038/s41576-019-0200-9], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489].
6 | Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787; @doi:10.1038/s41586-020-03145-z; @doi:10.1038/s41586-020-2559-3] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776].
7 | Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4; @doi:10.1038/ncomms6890] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105].
8 |
9 |
10 | Given the availability of gene expression data across several tissues [@doi:10.1038/nbt.3838; @doi:10.1038/s41467-018-03751-6; @doi:10.1126/science.aaz1776; @doi:10.1186/s13040-020-00216-9], an effective approach to identify these biological processes is the transcription-wide association study (TWAS), which integrates expression quantitative trait loci (eQTLs) data to provide a mechanistic interpretation for GWAS findings.
11 | TWAS relies on testing whether perturbations in gene regulatory mechanisms mediate the association between genetic variants and human diseases [@doi:10.1371/journal.pgen.1009482; @doi:10.1038/ng.3506; @doi:10.1371/journal.pgen.1007889; @doi:10.1038/ng.3367], and these approaches have been highly successful not only in understanding disease etiology at the transcriptome level [@pmid:33931583; @doi:10.1101/2021.10.21.21265225; @pmid:31036433] but also in disease-risk prediction (polygenic scores) [@doi:10.1186/s13059-021-02591-w] and drug repurposing [@doi:10.1038/nn.4618] tasks.
12 | However, TWAS works at the individual gene level, which does not capture more complex interactions at the network level.
13 |
14 |
15 | These gene-gene interactions play a crucial role in current theories of the architecture of complex traits, such as the omnigenic model [@doi:10.1016/j.cell.2017.05.038], which suggests that methods need to incorporate this complexity to disentangle disease-relevant mechanisms.
16 | Widespread gene pleiotropy, for instance, reveals the highly interconnected nature of transcriptional networks [@doi:10.1038/s41588-019-0481-0; @doi:10.1038/ng.3570], where potentially all genes expressed in disease-relevant cell types have a non-zero effect on the trait [@doi:10.1016/j.cell.2017.05.038; @doi:10.1016/j.cell.2019.04.014].
17 | One way to learn these gene-gene interactions is using the concept of gene module: a group of genes with similar expression profiles across different conditions [@pmid:22955619; @pmid:25344726; @doi:10.1038/ng.3259].
18 | In this context, several unsupervised approaches have been proposed to infer these gene-gene connections by extracting gene modules from co-expression patterns [@pmid:9843981; @pmid:24662387; @pmid:16333293].
19 | Matrix factorization techniques like independent or principal component analysis (ICA/PCA) have shown superior performance in this task [@doi:10.1038/s41467-018-03424-4] since they capture local expression effects from a subset of samples and can handle modules overlap effectively.
20 | Therefore, integrating genetic studies with gene modules extracted using unsupervised learning could further improve our understanding of disease origin [@pmid:25344726] and progression [@pmid:18631455].
21 |
22 |
23 | Here we propose PhenoPLIER, an omnigenic approach that provides a gene module perspective to genetic studies.
24 | The flexibility of our method allows integrating different data modalities into the same representation for a joint analysis.
25 | We show that this module perspective can infer how groups of functionally-related genes influence complex traits, detect shared and distinct transcriptomic properties among traits, and predict how pharmacological perturbations affect genes' activity to exert their effects.
26 | PhenoPLIER maps gene-trait associations and drug-induced transcriptional responses into a common latent representation.
27 | For this, we integrate thousands of gene-trait associations (using TWAS from PhenomeXcan [@doi:10.1126/sciadv.aba2083]) and transcriptional profiles of drugs (from LINCS L1000 [@doi:10.1016/j.cell.2017.10.049]) into a low-dimensional space learned from public gene expression data on tens of thousands of RNA-seq samples (recount2 [@doi:10.1016/j.cels.2019.04.003; @doi:10.1038/nbt.3838]).
28 | We use a latent representation defined by a matrix factorization approach [@doi:10.1038/s41592-019-0456-1; @doi:10.1016/j.cels.2019.04.003] that extracts gene modules with certain sparsity constraints and preferences for those that align with prior knowledge (pathways).
29 | When mapping gene-trait associations to this reduced expression space, we observe that diseases are significantly associated with gene modules expressed in relevant cell types: such as hypothyroidism with T cells, corneal endothelial cells with keratometry measurements, hematological assays on specific blood cell types, plasma lipids with adipose tissue, and neuropsychiatric disorders with different brain cell types.
30 | Moreover, since PhenoPLIER can use models derived from large and heterogeneous RNA-seq datasets, we can also identify modules associated with cell types under specific stimuli or disease states.
31 | We observe that significant module-trait associations in PhenomeXcan (our discovery cohort) replicated in the Electronic Medical Records and Genomics (eMERGE) network phase III [@doi:10.1038/gim.2013.72; @doi:10.1101/2021.10.21.21265225] (our replication cohort).
32 | Furthermore, we perform a CRISPR screen to analyze lipid regulation in HepG2 cells.
33 | We observe more robust trait associations with modules than with individual genes, even when single genes known to be involved in lipid metabolism did not reach genome-wide significance.
34 | Compared to a single-gene approach, our module-based method also better predicts FDA-approved drug-disease links by capturing tissue-specific pathophysiological mechanisms linked with the mechanism of action of drugs (e.g., niacin with cardiovascular traits via a known immune mechanism).
35 | This improved drug-disease prediction suggests that modules may provide a better means to examine drug-disease relationships than individual genes.
36 | Finally, exploring the phenotype-module space reveals stable trait clusters associated with relevant tissues, including a complex branch involving lipids with cardiovascular, autoimmune, and neuropsychiatric disorders.
37 | In summary, instead of considering single genes associated with different complex traits, PhenoPLIER incorporates groups of genes that act together to carry out different functions in specific cell types.
38 | This approach improves robustness in detecting and interpreting genetic associations, and here we show how it can prioritize alternative and potentially more promising candidate targets even when known single gene associations are not detected.
39 | The approach represents a conceptual shift in the interpretation of genetic studies.
40 | It has the potential to extract mechanistic insight from statistical associations to enhance the understanding of complex diseases and their therapeutic modalities.
41 |
--------------------------------------------------------------------------------
/tests/manuscripts/ccc/06.discussion.md:
--------------------------------------------------------------------------------
1 | ## Discussion
2 |
3 | We introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear machine learning-based statistic.
4 | Applying CCC to GTEx v8 revealed that it was robust to outliers and detected linear relationships as well as complex and biologically meaningful patterns that standard coefficients missed.
5 | In particular, CCC alone detected gene pairs with complex nonlinear patterns from the sex chromosomes, highlighting the way that not-only-linear coefficients can play in capturing sex-specific differences.
6 | The ability to capture these nonlinear patterns, however, extends beyond sex differences: it provides a powerful approach to detect complex relationships where a subset of samples or conditions are explained by other factors (such as differences between health and disease).
7 | We found that top CCC-ranked gene pairs in whole blood from GTEx were replicated in independent tissue-specific networks trained from multiple data types and attributed to cell lineages from blood, even though CCC did not have access to any cell lineage-specific information.
8 | This suggests that CCC can disentangle intricate cell lineage-specific transcriptional patterns missed by linear-only coefficients.
9 | In addition to capturing nonlinear patterns, the CCC was more similar to Spearman than Pearson, highlighting their shared robustness to outliers.
10 | The CCC results were concordant with MIC, but much faster to compute and thus practical for large datasets.
11 | Another advantage over MIC is that CCC can also process categorical variables together with numerical values.
12 | CCC is conceptually easy to interpret and has a single parameter that controls the maximum complexity of the detected relationships while also balancing compute time.
13 |
14 |
15 | Datasets such as Anscombe or "Datasaurus" highlight the value of visualization instead of relying on simple data summaries.
16 | While visual analysis is helpful, for many datasets examining each possible relationship is infeasible, and this is where more sophisticated and robust correlation coefficients are necessary.
17 | Advanced yet interpretable coefficients like CCC can focus human interpretation on patterns that are more likely to reflect real biology.
18 | The complexity of these patterns might reflect heterogeneity in samples that mask clear relationships between variables.
19 | For example, genes *UTY* - *KDM6A* (from sex chromosomes), detected by CCC, have a strong linear relationship but only in a subset of samples (males), which was not captured by linear-only coefficients.
20 | This example, in particular, highlights the importance of considering sex as a biological variable (SABV) [@doi:10.1038/509282a] to avoid overlooking important differences between men and women, for instance, in disease manifestations [@doi:10.1210/endrev/bnaa034; @doi:10.1038/s41593-021-00806-8].
21 | More generally, a not-only-linear correlation coefficient like CCC could identify significant differences between variables (such as genes) that are explained by a third factor (beyond sex differences), that would be entirely missed by linear-only coefficients.
22 |
23 |
24 | It is well-known that biomedical research is biased towards a small fraction of human genes [@pmid:17620606; @pmid:17472739].
25 | Some genes highlighted in CCC-ranked pairs (Figure @fig:upsetplot_coefs b), such as *SDS* (12q24) and *ZDHHC12* (9q34), were previously found to be the focus of fewer than expected publications [@pmid:30226837].
26 | It is possible that the widespread use of linear coefficients may bias researchers away from genes with complex coexpression patterns.
27 | A beyond-linear gene co-expression analysis on large compendia might shed light on the function of understudied genes.
28 | For example, gene *KLHL21* (1p36) and *AC068580.6* (*ENSG00000235027*, in 11p15) have a high CCC value and are missed by the other coefficients.
29 | *KLHL21* was suggested as a potential therapeutic target for hepatocellular carcinoma [@pmid:27769251] and other cancers [@pmid:29574153; @pmid:35084622].
30 | Its nonlinear correlation with *AC068580.6* might unveil other important players in cancer initiation or progression, potentially in subsets of samples with specific characteristics (as suggested in Figure @fig:upsetplot_coefs b).
31 |
32 |
33 | Not-only-linear correlation coefficients might also be helpful in the field of genetic studies.
34 | In this context, genome-wide association studies (GWAS) have been successful in understanding the molecular basis of common diseases by estimating the association between genotype and phenotype [@doi:10.1016/j.ajhg.2017.06.005].
35 | However, the estimated effect sizes of genes identified with GWAS are generally modest, and they explain only a fraction of the phenotype variance, hampering the clinical translation of these findings [@doi:10.1038/s41576-019-0127-1].
36 | Recent theories, like the omnigenic model for complex traits [@pmid:28622505; @pmid:31051098], argue that these observations are explained by highly-interconnected gene regulatory networks, with some core genes having a more direct effect on the phenotype than others.
37 | Using this omnigenic perspective, we and others [@doi:10.1101/2021.07.05.450786; @doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.10.21.21265342] have shown that integrating gene co-expression networks in genetic studies could potentially identify core genes that are missed by linear-only models alone like GWAS.
38 | Our results suggest that building these networks with more advanced and efficient correlation coefficients could better estimate gene co-expression profiles and thus more accurately identify these core genes.
39 | Approaches like CCC could play a significant role in the precision medicine field by providing the computational tools to focus on more promising genes representing potentially better candidate drug targets.
40 |
41 |
42 | Our analyses have some limitations.
43 | We worked on a sample with the top variable genes to keep computation time feasible.
44 | Although CCC is much faster than MIC, Pearson and Spearman are still the most computationally efficient since they only rely on simple data statistics.
45 | Our results, however, reveal the advantages of using more advanced coefficients like CCC for detecting and studying more intricate molecular mechanisms that replicated in independent datasets.
46 | The application of CCC on larger compendia, such as recount3 [@pmid:34844637] with thousands of heterogeneous samples across different conditions, can reveal other potentially meaningful gene interactions.
47 | The single parameter of CCC, $k_{\mathrm{max}}$, controls the maximum complexity of patterns found and also impacts the compute time.
48 | Our analysis suggested that $k_{\mathrm{max}}=10$ was sufficient to identify both linear and more complex patterns in gene expression.
49 | A more comprehensive analysis of optimal values for this parameter could provide insights to adjust it for different applications or data types.
50 |
51 |
52 | While linear and rank-based correlation coefficients are exceptionally fast to calculate, not all relevant patterns in biological datasets are linear.
53 | For example, patterns associated with sex as a biological variable are not apparent to the linear-only coefficients that we evaluated but are revealed by not-only-linear methods.
54 | Beyond sex differences, being able to use a method that inherently identifies patterns driven by other factors is likely to be desirable.
55 | Not-only-linear coefficients can also disentangle intricate yet relevant patterns from expression data alone that were replicated in models integrating different data modalities.
56 | CCC, in particular, is highly parallelizable, and we anticipate efficient GPU-based implementations that could make it even faster.
57 | The CCC is an efficient, next-generation correlation coefficient that is highly effective in transcriptome analyses and potentially useful in a broad range of other domains.
58 |
--------------------------------------------------------------------------------