├── .github └── workflows │ └── pythonpublish.yml ├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README.md ├── VERSION ├── examples ├── DQ2_2024_P08758.png ├── custom_list_of_proteins.tsv └── samples.tsv ├── ms1searchpy ├── __init__.py ├── combine.py ├── combine_proteins.py ├── directms1quant.py ├── directms1quantmulti.py ├── group_specific.py ├── main.py ├── models │ └── CSD_model_LCMSMS.hdf5 ├── ms1todiffacto.py ├── search.py ├── utils.py └── utils_figures.py ├── requirements.txt ├── setup.py └── test.sh /.github/workflows/pythonpublish.yml: -------------------------------------------------------------------------------- 1 | name: Upload Python Package 2 | 3 | on: 4 | release: 5 | types: [published] 6 | 7 | jobs: 8 | deploy: 9 | runs-on: ubuntu-latest 10 | steps: 11 | - uses: actions/checkout@v2 12 | - name: Set up Python 13 | uses: actions/setup-python@v2 14 | with: 15 | python-version: '3.x' 16 | - name: Install dependencies 17 | run: | 18 | python -m pip install --upgrade pip 19 | pip install setuptools wheel twine 20 | - name: Build and publish 21 | env: 22 | TWINE_USERNAME: '__token__' 23 | TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }} 24 | run: | 25 | python setup.py sdist bdist_wheel 26 | twine upload dist/* 27 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.egg-info 2 | **/__pycache__ 3 | test_data* 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2021 Mark Ivanov 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include requirements.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ms1searchpy - a DirectMS1 proteomics search engine for LC-MS1 spectra 2 | 3 | `ms1searchpy` consumes LC-MS data (**mzML**) or peptide features (**tsv**) and performs protein identification and quantitation. It is recommended to run the peptide cluster detection (biosaur2) separately so that the user can control the cluster search parameters. In addition, this will eliminate unnecessary calculations when reprocessing files. 4 | 5 | ## Basic usage 6 | 7 | Basic command for protein identification: 8 | 9 | biosaur2 *.mzML 10 | ms1searchpy *features.tsv -d path_to.FASTA 11 | 12 | or 13 | 14 | ms1searchpy *.mzML -d path_to.FASTA 15 | 16 | Read further for detailed info, including quantitative analysis. 17 | 18 | ## Citing ms1searchpy 19 | 20 | Ivanov et al. DirectMS1Quant: Ultrafast Quantitative Proteomics with MS/MS-Free Mass Spectrometry. https://pubs.acs.org/doi/10.1021/acs.analchem.2c02255 21 | 22 | Ivanov et al. Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient. https://doi.org/10.1021/acs.jproteome.0c00863 23 | 24 | Ivanov et al. DirectMS1: MS/MS-free identification of 1000 proteins of cellular proteomes in 5 minutes. https://doi.org/10.1021/acs.analchem.9b05095 25 | 26 | ## Installation 27 | 28 | It is recommended to additionally install [DeepLC](https://github.com/compomics/DeepLC) version 1.1.2.2 (unofficial fork with small changes). Newer version has some issues right now. It is highly important to install DeepLC before ms1searchpy for outdated packages compatibility! 29 | 30 | pip install https://github.com/markmipt/DeepLC/archive/refs/heads/alternative_best_model.zip 31 | 32 | After that, install ms1searchpy: 33 | 34 | pip install ms1searchpy 35 | 36 | This should work on recent versions of Python (3.8-3.10). 37 | 38 | ## Usage tutorial: protein identification 39 | 40 | The script used for protein identification is called `ms1searchpy`. It needs input files (mzML or tsv) and a FASTA database. 41 | 42 | ### Input files 43 | 44 | If mzML are provided, ms1searchpy will invoke [biosaur2](https://github.com/markmipt/biosaur2) to generate the features table. 45 | You can also use other software like [Dinosaur](https://github.com/fickludd/dinosaur) or [Biosaur](https://github.com/abdrakhimov1/Biosaur), 46 | but [biosaur2](https://github.com/markmipt/biosaur2) is recommended. You can also make it yourself, 47 | the table must contain columns 'massCalib', 'rtApex', 'charge' and 'nIsotopes' columns. 48 | 49 | #### How to get mzML files 50 | 51 | To get mzML from RAW files, you can use [Proteowizard MSConvert](https://proteowizard.sourceforge.io/download.html)... 52 | 53 | msconvert path_to_file.raw -o path_to_output_folder --mzML --filter "peakPicking true 1-" --filter "MS2Deisotope" --filter "zeroSamples removeExtra" --filter "threshold absolute 1 most-intense" 54 | 55 | ...or [compomics ThermoRawFileParser](https://github.com/compomics/ThermoRawFileParser), which produces suitable files 56 | with default parameters. 57 | 58 | ### RT predictor 59 | 60 | For protein identification, `ms1searchpy` needs a retention time prediction model. The recommended one is [DeepLC](https://github.com/compomics/DeepLC), 61 | but you can also use built-in additive model (default). 62 | 63 | ### Examples 64 | 65 | biosaur2 test.mzML -minlh 3 66 | ms1searchpy test.features.tsv -d sprot_human.fasta -deeplc 1 -ad 1 67 | 68 | The first command will run `biosaur2` to detect all peptide isotopic clusters which are visible in at least 3 consecutive MS1 scans. The second command will run `ms1searchpy` with DeepLC RT predictor available as `deeplc` (should work if you install DeepLC 69 | alongside `ms1searchpy`. `-ad 1` creates a shuffled decoy database for FDR estimation. 70 | You should use it only once and just use the created database for other searches. 71 | 72 | ### Output files 73 | 74 | `ms1searchpy` produces several tables: 75 | - identified proteins, FDR-filtered (`sample.features_proteins.tsv`) - this is the main result; 76 | - all identified proteins (`sample.features_proteins_full.tsv`); 77 | - all identified proteins based on all PFMs (`sample.features_proteins_full_noexclusion.tsv`); 78 | - all matched peptide match fingerprints, or peptide-feature matches (`sample.features_PFMs.tsv`); 79 | - all PFMs with features prepared for Machnine Learning (`sample.features_PFMs_ML.tsv`); 80 | - number of theoretical peptides per protein (`sample.features_protsN.tsv`); 81 | - log file with estimated mass and RT accuracies (`sample.features_log.txt`). 82 | 83 | ### Combine results from replicates 84 | 85 | You can combine the results from several replicate runs with `ms1combine` by feeding it `_PFMs_ML.tsv` tables: 86 | 87 | ms1combine sample_rep_*.features_PFMs_ML.tsv 88 | 89 | ### Using Group-specific FDR for metaproteomics 90 | 91 | Group-specific FDR for metaproteomics should be used for accurate estimation of protein identified among the different groups. The command ms1groups should be used for that: 92 | 93 | ms1groups F04.features_PFMs_ML.tsv -d F04_top15_shuffled.fasta -out group_statistics_by -fdr 5.0 -groups genus 94 | 95 | It produces a table with the number of identified proteins for each group using group-specific FDR. This is basically multiple DirectMS1 searches with small protein databases containing only a single group and combining results all together. However, using the mentioned “ms1groups” script and preliminary DirectMS1 search, two problems are solved: small statistics for all mass/RT/Machine Learning calibration procedures within DirectMS1 workflow for low-populated groups and computational time. Currently supported groups are 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain'. The groups are automatically extracted using ete3 Python module and NCBI Taxonomic database. Also, the script supports groups dbname and OX: the the former is a taxonomy in swiss-prot protein name (_HUMAN, _YEAST, etc.) and the latter is the taxonomy provided by 'OX=' from protein description in the fasta file. 96 | 97 | ## Usage tutorial: Quantitation 98 | 99 | After obtaining the protein identification results, you can proceed to compare your samples using LFQ. 100 | 101 | ### Using directms1quant 102 | 103 | New LFQ method designed specifically for DirectMS1 is invoked like this: 104 | 105 | directms1quant -S1 sample1_r{1,2,3}.features_proteins_full.tsv -S2 sample2_r{1,2,3}.features_proteins_full.tsv 106 | 107 | It produces a filtered table of significantly changed proteins with p-values and fold changes, 108 | as well as the full protein table and a separate file simply listing all 109 | IDs of significantly modified proteins (e.g. for easy copy-paste into a StringDB search window). 110 | 111 | 112 | ### Multi-condition protein profiling using directms1quantmulti 113 | 114 | You can make a quantitation for complex projects using script directms1quantmulti. The example below is shown for our project of time-series profiling of glioblastoma cell line under interferon treatment. 115 | 116 | Script takes a tab-separated table (.tsv) with details for all project files. An example of a sample file table is available here in the examples folder. It should contain the following columns: 117 | 118 | File Name - filename of raw file. For example, “QEHFX_JB_000379”. 119 | 120 | group - sample group of file. In our example, there are K (Control group), IFN30 (treatment with 30 units/ml of interferon) and IFN1000 groups. The first group mentioned in the table will be used as control for pairwise directms1quant runs. 121 | 122 | condition - sample subgroup of file. In our example, there are multiple time points after treatment, such as 0h, 30min, 1h, 2h, etc. By default, only the same conditions will be used for pairwise comparisons. For example, IFN30 0h vs K 0h; IFN1000 0h vs K 0h, etc. 123 | 124 | vs - column for specific condition comparison. For example, in our case, we did not have control samples at the 30 min time point. Thus, we would like to proceed directms1quant runs for IFN30 30 min vs K 0h; and IFN1000 30 min vs K 0h comparisons. Thus, for the 30 min IFN30 and IFN1000 files we put “0h” in the “vs” column. See example table for details. 125 | 126 | replicate - column for replicate number of specific condition and sample group. 127 | 128 | BatchMS - column for mass-spectrometry Batch. This parameter is used for extra normalization within a batch. 129 | 130 | 131 | The script consists of four different stages and you can rerun the script without rerunning previous stages (“-start_stage” option). 132 | 133 | Stage 1 is a set of pairwise DirectMS1Quant runs for different interferon treatment conditions versus control samples. 134 | 135 | Stage 2 is preparation of peptide LFQ table for all files using the results obtained in the previous step. 136 | 137 | Stage 3 is preparation of the protein LFQ table. Only the peptides labeled by DirectMS1Quant as significantly different between samples in at least X pairwise comparisons are used for protein quantitation. The X parameter is controlled by “min_signif_for_pept” option. 138 | 139 | Stage 4 is preparation of LFQ profiling figures for proteins specified in the file under “proteins_for_figure” option. The file should be a tsv table with column “dbname” containing protein database names in the swiss-prot format. Any default directms1quant output table with differentially expressed proteins can be used here. 140 | 141 | 142 | Example of script usage:: 143 | 144 | directms1quantmulti -db ~/fasta_folder/sprot_human_shuffled.fasta -pdir ~/folder_with_ms1searchpy_results/ -samples ~/samples.csv -min_signif_for_pept 2 -out DQmulti_2024 -pep_min_non_missing_samples 0.75 -start_stage 1 -proteins_for_figure ~/custom_list_of_proteins.tsv -figdir ~/output_figure_folder/ 145 | 146 | ## Links 147 | 148 | - GitHub repo & issue tracker: https://github.com/markmipt/ms1searchpy 149 | - Mailing list: markmipt@gmail.com 150 | 151 | - DeepLC repo: https://github.com/compomics/DeepLC 152 | -------------------------------------------------------------------------------- /VERSION: -------------------------------------------------------------------------------- 1 | 2.8.12 2 | -------------------------------------------------------------------------------- /examples/DQ2_2024_P08758.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/examples/DQ2_2024_P08758.png -------------------------------------------------------------------------------- /examples/custom_list_of_proteins.tsv: -------------------------------------------------------------------------------- 1 | dbname 2 | sp|P08758|ANXA5_HUMAN 3 | sp|P04083|ANXA1_HUMAN 4 | sp|P07355|ANXA2_HUMAN 5 | sp|P08133|ANXA6_HUMAN 6 | -------------------------------------------------------------------------------- /examples/samples.tsv: -------------------------------------------------------------------------------- 1 | File Name group condition replicate BatchMS vs 2 | QEHFX_JB_000379 K 0h 1 1 3 | QEHFX_JB_000380 K 0h 2 1 4 | QEHFX_JB_000381 K 0h 3 1 5 | QEHFX_JB_000382 K 0h 4 1 6 | QEHFX_JB_000383 K 0h 5 1 7 | QEHFX_JB_000384 K 24h 1 1 8 | QEHFX_JB_000385 K 24h 2 1 9 | QEHFX_JB_000386 K 24h 3 1 10 | QEHFX_JB_000387 K 24h 4 1 11 | QEHFX_JB_000388 K 24h 5 1 12 | QEHFX_JB_000389 K 12h 1 1 13 | QEHFX_JB_000390 K 12h 2 1 14 | QEHFX_JB_000391 K 12h 3 1 15 | QEHFX_JB_000392 K 12h 4 1 16 | QEHFX_JB_000393 K 12h 5 1 17 | QEHFX_JB_000394 K 2h 1 1 18 | QEHFX_JB_000395 K 2h 2 1 19 | QEHFX_JB_000396 K 2h 3 1 20 | QEHFX_JB_000397 K 2h 4 1 21 | QEHFX_JB_000398 K 2h 5 1 22 | QEHFX_JB_000399 IFN30 24h 1 1 23 | QEHFX_JB_000400 IFN30 24h 2 1 24 | QEHFX_JB_000401 IFN30 24h 3 1 25 | QEHFX_JB_000402 IFN30 24h 4 1 26 | QEHFX_JB_000403 IFN30 24h 5 1 27 | QEHFX_JB_000404 IFN30 16h 1 1 12h 28 | QEHFX_JB_000405 IFN30 16h 2 1 12h 29 | QEHFX_JB_000406 IFN30 16h 3 1 12h 30 | QEHFX_JB_000407 IFN30 16h 4 1 12h 31 | QEHFX_JB_000408 IFN30 16h 5 1 12h 32 | QEHFX_JB_000409 IFN30 20h 1 1 24h 33 | QEHFX_JB_000410 IFN30 20h 2 1 24h 34 | QEHFX_JB_000411 IFN30 20h 3 1 24h 35 | QEHFX_JB_000412 IFN30 20h 4 1 24h 36 | QEHFX_JB_000413 IFN30 20h 5 1 24h 37 | QEHFX_JB_000414 IFN30 4h 1 1 2h 38 | QEHFX_JB_000415 IFN30 4h 2 1 2h 39 | QEHFX_JB_000416 IFN30 4h 3 1 2h 40 | QEHFX_JB_000417 IFN30 4h 4 1 2h 41 | QEHFX_JB_000418 IFN30 4h 5 1 2h 42 | QEHFX_JB_000419 IFN30 7h 1 1 12h 43 | QEHFX_JB_000420 IFN30 7h 2 1 12h 44 | QEHFX_JB_000421 IFN30 7h 3 1 12h 45 | QEHFX_JB_000422 IFN30 7h 4 1 12h 46 | QEHFX_JB_000423 IFN30 7h 5 1 12h 47 | QEHFX_JB_000424 IFN30 30min 1 1 0h 48 | QEHFX_JB_000425 IFN30 30min 2 1 0h 49 | QEHFX_JB_000426 IFN30 30min 3 1 0h 50 | QEHFX_JB_000427 IFN30 30min 4 1 0h 51 | QEHFX_JB_000428 IFN30 30min 5 1 0h 52 | QEHFX_JB_000429 IFN30 1h 1 1 0h 53 | QEHFX_JB_000430 IFN30 1h 2 1 0h 54 | QEHFX_JB_000431 IFN30 1h 3 1 0h 55 | QEHFX_JB_000432 IFN30 1h 4 1 0h 56 | QEHFX_JB_000433 IFN30 1h 5 1 0h 57 | QEHFX_JB_000434 IFN30 3h 1 1 2h 58 | QEHFX_JB_000435 IFN30 3h 2 1 2h 59 | QEHFX_JB_000436 IFN30 3h 3 1 2h 60 | QEHFX_JB_000437 IFN30 3h 4 1 2h 61 | QEHFX_JB_000442 IFN30 0h 1 1 2h 62 | QEHFX_JB_000443 IFN30 0h 2 1 63 | QEHFX_JB_000444 IFN30 0h 3 1 64 | QEHFX_JB_000445 IFN30 0h 4 1 65 | QEHFX_JB_000446 IFN30 0h 5 1 66 | QEHFX_JB_000447 IFN30 2h 1 1 2h 67 | QEHFX_JB_000448 IFN30 2h 2 1 2h 68 | QEHFX_JB_000449 IFN30 2h 3 1 2h 69 | QEHFX_JB_000450 IFN30 2h 4 1 2h 70 | QEHFX_JB_000451 IFN30 2h 5 1 2h 71 | QEHFX_JB_000452 IFN30 12h 1 1 12h 72 | QEHFX_JB_000453 IFN30 12h 2 1 12h 73 | QEHFX_JB_000454 IFN30 12h 3 1 12h 74 | QEHFX_JB_000455 IFN30 12h 4 1 12h 75 | QEHFX_JB_000456 IFN30 12h 5 1 12h 76 | QEHFX_JB_000457 IFN1000 16h 1 1 12h 77 | QEHFX_JB_000458 IFN1000 16h 2 1 12h 78 | QEHFX_JB_000459 IFN1000 16h 3 1 12h 79 | QEHFX_JB_000460 IFN1000 16h 4 1 12h 80 | QEHFX_JB_000461 IFN1000 16h 5 1 12h 81 | QEHFX_JB_000467 IFN1000 20h 1 1 24h 82 | QEHFX_JB_000468 IFN1000 20h 2 1 24h 83 | QEHFX_JB_000469 IFN1000 20h 3 1 24h 84 | QEHFX_JB_000470 IFN1000 20h 4 1 24h 85 | QEHFX_JB_000471 IFN1000 20h 5 1 24h 86 | QEHFX_JB_000472 IFN1000 24h 1 1 24h 87 | QEHFX_JB_000473 IFN1000 24h 2 1 24h 88 | QEHFX_JB_000474 IFN1000 24h 3 1 24h 89 | QEHFX_JB_000475 IFN1000 24h 4 1 24h 90 | QEHFX_JB_000476 IFN1000 24h 5 1 24h 91 | QEHFX_JB_000477 IFN1000 12h 1 1 12h 92 | QEHFX_JB_000478 IFN1000 12h 2 1 12h 93 | QEHFX_JB_000479 IFN1000 12h 3 1 12h 94 | QEHFX_JB_000480 IFN1000 12h 4 1 12h 95 | QEHFX_JB_000481 IFN1000 12h 5 1 12h 96 | QEHFX_JB_000482 IFN1000 0h 1 1 0h 97 | QEHFX_JB_000483 IFN1000 0h 2 1 0h 98 | QEHFX_JB_000484 IFN1000 0h 3 1 0h 99 | QEHFX_JB_000485 IFN1000 0h 4 1 0h 100 | QEHFX_JB_000486 IFN1000 0h 5 1 0h 101 | QEHFX_JB_000487 IFN1000 3h 1 1 2h 102 | QEHFX_JB_000488 IFN1000 3h 2 1 2h 103 | QEHFX_JB_000489 IFN1000 3h 3 1 2h 104 | QEHFX_JB_000490 IFN1000 3h 4 1 2h 105 | QEHFX_JB_000491 IFN1000 4h 1 1 2h 106 | QEHFX_JB_000492 IFN1000 4h 2 1 2h 107 | QEHFX_JB_000493 IFN1000 4h 3 1 2h 108 | QEHFX_JB_000494 IFN1000 4h 4 1 2h 109 | QEHFX_JB_000495 IFN1000 4h 5 1 2h 110 | QEHFX_JB_000496 IFN1000 2h 1 1 2h 111 | QEHFX_JB_000497 IFN1000 2h 2 1 2h 112 | QEHFX_JB_000498 IFN1000 2h 3 1 2h 113 | QEHFX_JB_000499 IFN1000 2h 4 1 2h 114 | QEHFX_JB_000500 IFN1000 2h 5 1 2h 115 | QEHFX_JB_000501 IFN1000 1h 1 1 0h 116 | QEHFX_JB_000502 IFN1000 1h 2 1 0h 117 | QEHFX_JB_000503 IFN1000 1h 3 1 0h 118 | QEHFX_JB_000504 IFN1000 1h 4 1 0h 119 | QEHFX_JB_000505 IFN1000 1h 5 1 0h 120 | QEHFX_JB_000506 IFN1000 7h 1 1 12h 121 | QEHFX_JB_000507 IFN1000 7h 2 1 12h 122 | QEHFX_JB_000508 IFN1000 7h 3 1 12h 123 | QEHFX_JB_000509 IFN1000 7h 4 1 12h 124 | QEHFX_JB_000510 IFN1000 7h 5 1 12h 125 | QEHFX_JB_000511 IFN1000 30min 1 1 0h 126 | QEHFX_JB_000512 IFN1000 30min 2 1 0h 127 | QEHFX_JB_000513 IFN1000 30min 3 1 0h 128 | QEHFX_JB_000514 IFN1000 30min 4 1 0h 129 | QEHFX_JB_000515 IFN1000 30min 5 1 0h 130 | -------------------------------------------------------------------------------- /ms1searchpy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/ms1searchpy/__init__.py -------------------------------------------------------------------------------- /ms1searchpy/combine.py: -------------------------------------------------------------------------------- 1 | from .main import final_iteration, filter_results 2 | import pandas as pd 3 | from collections import defaultdict 4 | import argparse 5 | import logging 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | def run(): 10 | parser = argparse.ArgumentParser( 11 | description='Combine DirectMS1 search results', 12 | epilog=''' 13 | 14 | Example usage 15 | ------------- 16 | $ ms1combine.py file1_PFMs_ML.tsv ... filen_PFMs_ML.tsv 17 | ------------- 18 | ''', 19 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 20 | 21 | parser.add_argument('file', nargs='+', help='input tsv PFMs_ML files') 22 | parser.add_argument('-out', help='prefix output file names', default='combined') 23 | parser.add_argument('-prots_full', help='path to any of *_proteins_full.tsv file. By default this file will be searched in the folder with PFMs_ML files', default='') 24 | parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float) 25 | parser.add_argument('-prefix', help='decoy prefix', default='DECOY_') 26 | parser.add_argument('-nproc', help='number of processes', default=1, type=int) 27 | parser.add_argument('-pp', help='protein priority table for keeping protein groups when merge results by scoring', default='') 28 | args = vars(parser.parse_args()) 29 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 30 | datefmt='[%H:%M:%S]', level=logging.INFO) 31 | process_files(args) 32 | 33 | 34 | def process_files(args): 35 | d_tmp = dict() 36 | 37 | df1 = None 38 | for idx, filen in enumerate(args['file']): 39 | logger.info('Reading file %s' % (filen, )) 40 | df3 = pd.read_csv(filen, sep='\t', usecols=['ids', 'qpreds', 'preds', 'decoy', 'seqs', 'proteins', 'peptide', 'iorig']) 41 | df3['ids'] = df3['ids'].apply(lambda x: '%d:%s' % (idx, str(x))) 42 | df3['fidx'] = idx 43 | 44 | df3 = df3[df3['qpreds'] <= 10] 45 | 46 | 47 | qval_ok = 0 48 | for qval_cur in range(10): 49 | if qval_cur != 10: 50 | df1ut = df3[df3['qpreds'] == qval_cur] 51 | decoy_ratio = df1ut['decoy'].sum() / len(df1ut) 52 | d_tmp[(idx, qval_cur)] = decoy_ratio 53 | # print(filen, qval_cur, decoy_ratio) 54 | 55 | if df1 is None: 56 | df1 = df3 57 | if args['prots_full']: 58 | df2 = pd.read_csv(args['prots_full'], sep='\t') 59 | else: 60 | try: 61 | df2 = pd.read_csv(filen.replace('_PFMs_ML.tsv', '_proteins_full.tsv'), sep='\t') 62 | except: 63 | logging.critical('Proteins_full file is missing!') 64 | break 65 | 66 | else: 67 | df1 = pd.concat([df1, df3], ignore_index=True) 68 | 69 | d_tmp = [z[0] for z in sorted(d_tmp.items(), key=lambda x: x[1])] 70 | qdict = {} 71 | for idx, val in enumerate(d_tmp): 72 | qdict[val] = int(idx / len(args['file'])) 73 | df1['qpreds'] = df1.apply(lambda x: qdict[(x['fidx'], x['qpreds'])], axis=1) 74 | 75 | 76 | pept_prot = defaultdict(set) 77 | for seq, prots in df1[['seqs', 'proteins']].values: 78 | for dbname in prots.split(';'): 79 | pept_prot[seq].add(dbname) 80 | 81 | protsN = dict() 82 | for dbname, theorpept in df2[['dbname', 'theoretical peptides']].values: 83 | protsN[dbname] = theorpept 84 | 85 | 86 | prefix = args['prefix'] 87 | isdecoy = lambda x: x[0].startswith(prefix) 88 | isdecoy_key = lambda x: x.startswith(prefix) 89 | escore = lambda x: -x[1] 90 | fdr = float(args['fdr']) / 100 91 | 92 | resdict = dict() 93 | 94 | resdict['qpreds'] = df1['qpreds'].values 95 | resdict['preds'] = df1['preds'].values 96 | resdict['seqs'] = df1['peptide'].values 97 | resdict['ids'] = df1['ids'].values 98 | resdict['iorig'] = df1['iorig'].values 99 | 100 | # mass_diff = resdict['qpreds'] 101 | # rt_diff = resdict['qpreds'] 102 | 103 | base_out_name = args['out'] 104 | 105 | e_ind = resdict['qpreds'] <= 9 106 | resdict = filter_results(resdict, e_ind) 107 | 108 | mass_diff = resdict['qpreds'] 109 | rt_diff = resdict['qpreds'] 110 | 111 | if args['pp']: 112 | df4 = pd.read_table(args['pp']) 113 | prots_spc_basic2 = df4.set_index('dbname')['score'].to_dict() 114 | else: 115 | prots_spc_basic2 = False 116 | 117 | final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], prots_spc_basic2=prots_spc_basic2) 118 | 119 | 120 | if __name__ == '__main__': 121 | run() 122 | -------------------------------------------------------------------------------- /ms1searchpy/combine_proteins.py: -------------------------------------------------------------------------------- 1 | from pyteomics import auxiliary as aux 2 | import pandas as pd 3 | import argparse 4 | import logging 5 | 6 | logger = logging.getLogger(__name__) 7 | 8 | def run(): 9 | parser = argparse.ArgumentParser( 10 | description='Combine DirectMS1 search results', 11 | epilog=''' 12 | 13 | Example usage 14 | ------------- 15 | $ ms1combine_proteins file1_proteins_full.tsv ... filen_proteins_full.tsv 16 | ------------- 17 | ''', 18 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 19 | 20 | parser.add_argument('file', nargs='+', help='input tsv proteins_full files') 21 | parser.add_argument('-out', help='prefix for joint file name', default='combined') 22 | parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float) 23 | parser.add_argument('-prefix', help='decoy prefix', default='DECOY_') 24 | args = vars(parser.parse_args()) 25 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 26 | datefmt='[%H:%M:%S]', level=logging.INFO) 27 | 28 | tmp_list = [] 29 | for idx, filen in enumerate(args['file']): 30 | tmp_list.append(pd.read_csv(filen, sep='\t')) 31 | 32 | df1 = pd.concat(tmp_list).groupby(['dbname']).sum().reset_index() 33 | out_name = '%s.features_proteins.tsv' % (args['out'], ) 34 | 35 | 36 | 37 | 38 | prots_spc = df1.set_index('dbname').score.to_dict() 39 | fdr = args['fdr'] / 100 40 | 41 | prefix = args['prefix'] 42 | isdecoy = lambda x: x[0].startswith(prefix) 43 | isdecoy_key = lambda x: x.startswith(prefix) 44 | escore = lambda x: -x[1] 45 | 46 | 47 | 48 | checked = set() 49 | for k, v in list(prots_spc.items()): 50 | if k not in checked: 51 | if isdecoy_key(k): 52 | if prots_spc.get(k.replace(prefix, ''), -1e6) > v: 53 | del prots_spc[k] 54 | checked.add(k.replace(prefix, '')) 55 | else: 56 | if prots_spc.get(prefix + k, -1e6) > v: 57 | del prots_spc[k] 58 | checked.add(prefix + k) 59 | 60 | filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=1) 61 | if len(filtered_prots) < 1: 62 | filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=0) 63 | identified_proteins = 0 64 | 65 | for x in filtered_prots: 66 | identified_proteins += 1 67 | 68 | logger.info('TOP 5 identified proteins:') 69 | logger.info('dbname\tscore') 70 | for x in filtered_prots[:5]: 71 | logger.info('\t'.join((str(x[0]), str(x[1])))) 72 | logger.info('Final joint search: identified proteins = %d', len(filtered_prots)) 73 | 74 | df1 = pd.DataFrame.from_records(filtered_prots, columns=['dbname', 'score']) 75 | df1.to_csv(out_name, index=False, sep='\t') 76 | 77 | if __name__ == '__main__': 78 | run() 79 | -------------------------------------------------------------------------------- /ms1searchpy/directms1quant.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import argparse 3 | import pandas as pd 4 | import numpy as np 5 | from scipy.stats import binom, ttest_ind 6 | from scipy.optimize import curve_fit 7 | import logging 8 | from pyteomics import fasta 9 | from collections import Counter, defaultdict 10 | import random 11 | random.seed(42) 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | def get_df_final(args, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False): 17 | 18 | df_final = False 19 | for i in range(1, 3, 1): 20 | sample_num = 'S%d' % (i, ) 21 | if args.get(sample_num, 0): 22 | for z in args[sample_num]: 23 | label = sample_num + '_' + z.replace(replace_label, '') 24 | df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', 'charge', 'ion_mobility', 'Intensity', 'RT']) 25 | 26 | 27 | if not args['allowed_peptides']: 28 | df3['tmpseq'] = df3['sequence'] 29 | df3 = df3[df3['tmpseq'].apply(lambda x: x in allowed_peptides)] 30 | else: 31 | df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)] 32 | 33 | 34 | df3 = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots_all for z in x.split(';')))] 35 | df3['proteins'] = df3['proteins'].apply(lambda x: ';'.join([z for z in x.split(';') if z in allowed_prots_all])) 36 | 37 | df3['origseq'] = df3['sequence'] 38 | df3['sequence'] = df3['sequence'] + df3['charge'].astype(int).astype(str) + df3['ion_mobility'].astype(str) 39 | 40 | if pep_RT is not False: 41 | df3 = df3[df3.apply(lambda x: abs(pep_RT[x['sequence']] - x['RT']) <= 3 * RT_threshold, axis=1)] 42 | 43 | df3 = df3.sort_values(by='Intensity', ascending=False) 44 | 45 | df3 = df3.drop_duplicates(subset='sequence') 46 | 47 | df3[label] = df3['Intensity'] 48 | df3['protein'] = df3['proteins'] 49 | df3['peptide'] = df3['sequence'] 50 | if pep_RT is False: 51 | df3['RT_'+label] = df3['RT'] 52 | df3 = df3[['origseq', 'peptide', 'protein', label, 'RT_'+label]] 53 | else: 54 | df3 = df3[['origseq', 'peptide', 'protein', label]] 55 | 56 | 57 | if df_final is False: 58 | df_final = df3.reset_index(drop=True) 59 | else: 60 | df_final = df_final.reset_index(drop=True).merge(df3.reset_index(drop=True), on='peptide', how='outer') 61 | df_final.protein_x = df_final.protein_x.fillna(value=df_final.protein_y) 62 | df_final.origseq_x = df_final.origseq_x.fillna(value=df_final.origseq_y) 63 | df_final['protein'] = df_final['protein_x'] 64 | df_final['origseq'] = df_final['origseq_x'] 65 | 66 | df_final = df_final.drop(columns=['protein_x', 'protein_y']) 67 | df_final = df_final.drop(columns=['origseq_x', 'origseq_y']) 68 | 69 | return df_final 70 | 71 | 72 | def calc_sf_all(v, n, p): 73 | sf_values = -np.log10(binom.sf(v-1, n, p)) 74 | sf_values[v <= 2] = 0 75 | sf_values[np.isinf(sf_values)] = 20 76 | sf_values[n == 0] = 0 77 | return sf_values 78 | 79 | def noisygaus(x, a, x0, sigma, b): 80 | return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) + b 81 | 82 | def calibrate_mass(bwidth, mass_left, mass_right, true_md): 83 | 84 | bbins = np.arange(-mass_left, mass_right, bwidth) 85 | H1, b1 = np.histogram(true_md, bins=bbins) 86 | b1 = b1 + bwidth 87 | b1 = b1[:-1] 88 | 89 | 90 | popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), 1, 1]) 91 | mass_shift, mass_sigma = popt[1], abs(popt[2]) 92 | return mass_shift, mass_sigma, pcov[0][0] 93 | 94 | def run(): 95 | parser = argparse.ArgumentParser( 96 | description='run DirectMS1quant for ms1searchpy results', 97 | epilog=''' 98 | 99 | Example usage 100 | ------------- 101 | $ directms1quant -S1 sample1_1_proteins_full.tsv sample1_n_proteins_full.tsv -S2 sample2_1_proteins_full.tsv sample2_n_proteins_full.tsv 102 | ------------- 103 | ''', 104 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 105 | 106 | parser.add_argument('-S1', nargs='+', help='input files for S1 sample', required=True) 107 | parser.add_argument('-S2', nargs='+', help='input files for S2 sample', required=True) 108 | parser.add_argument('-out', help='name of DirectMS1quant output file', default='directms1quant_out') 109 | parser.add_argument('-min_samples', help='minimum number of samples for peptide usage. 0 means 50%% of input files', default=0) 110 | parser.add_argument('-fold_change', help='FC threshold standard deviations', default=2.0, type=float) 111 | parser.add_argument('-fold_change_abs', help='Use absolute log2 scale FC threshold instead of standard deviations', action='store_true') 112 | # parser.add_argument('-bp', help='Experimental. Better percentage', default=80, type=int) 113 | parser.add_argument('-minl', help='Min peptide length for quantitation', default=7, type=int) 114 | parser.add_argument('-qval', help='qvalue threshold', default=0.05, type=float) 115 | parser.add_argument('-intensity_norm', help='Intensity normalization: 0-none, 1-median, 2-sum 1000 most intense peptides (default)', default=2, type=int) 116 | parser.add_argument('-all_proteins', help='use all proteins instead of FDR controlled', action='store_true') 117 | parser.add_argument('-all_pfms', help='use all PFMs instead of ML controlled', action='store_true') 118 | parser.add_argument('-allowed_peptides', help='path to allowed peptides') 119 | parser.add_argument('-allowed_proteins', help='path to allowed proteins') 120 | parser.add_argument('-protein_shifts', help='Experimental. path to protein shifts') 121 | parser.add_argument('-d', '-db', help='path to uniprot fasta file for gene annotation') 122 | parser.add_argument('-prefix', help='Decoy prefix. Default DECOY_', default='DECOY_', type=str) 123 | args = vars(parser.parse_args()) 124 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 125 | datefmt='[%H:%M:%S]', level=logging.INFO) 126 | 127 | process_files(args) 128 | 129 | 130 | def process_files(args): 131 | replace_label = '_proteins_full.tsv' 132 | decoy_prefix = args['prefix'] 133 | 134 | fold_change = float(args['fold_change']) 135 | 136 | all_s_lbls = {} 137 | 138 | logger.info('Starting analysis...') 139 | 140 | allowed_prots = set() 141 | allowed_prots_all = set() 142 | allowed_peptides = set() 143 | 144 | # cnt0 = Counter() 145 | 146 | cnt_file = 0 147 | 148 | for i in range(1, 3, 1): 149 | sample_num = 'S%d' % (i, ) 150 | if args[sample_num]: 151 | 152 | 153 | all_s_lbls[sample_num] = [] 154 | 155 | for z in args[sample_num]: 156 | 157 | cnt_file += 1 158 | logger.debug('Processing file %d', cnt_file) 159 | 160 | label = sample_num + '_' + z.replace(replace_label, '') 161 | all_s_lbls[sample_num].append(label) 162 | 163 | if not args['allowed_proteins']: 164 | if not args['all_proteins']: 165 | df0 = pd.read_table(z.replace('_proteins_full.tsv', '_proteins.tsv'), usecols=['dbname', ]) 166 | allowed_prots.update(df0['dbname']) 167 | allowed_prots.update([decoy_prefix + z for z in df0['dbname'].values]) 168 | else: 169 | df0 = pd.read_table(z, usecols=['dbname', ]) 170 | allowed_prots.update(df0['dbname']) 171 | 172 | if not args['allowed_peptides']: 173 | df0 = pd.read_table(z.replace('_proteins_full.tsv', '_PFMs_ML.tsv'), usecols=['seqs', 'qpreds', 'plen']) 174 | 175 | 176 | if not args['all_pfms']: 177 | df0 = df0[df0['qpreds'] <= 10] 178 | 179 | df0 = df0[df0['plen'] >= args['minl']] 180 | # df0['seqs'] = df0['seqs'] 181 | allowed_peptides.update(df0['seqs']) 182 | # cnt0.update(df0['seqs']) 183 | 184 | if args['allowed_proteins']: 185 | allowed_prots = set(z.strip() for z in open(args['allowed_proteins'], 'r').readlines()) 186 | allowed_prots.update([decoy_prefix + z for z in allowed_prots]) 187 | 188 | if args['allowed_peptides']: 189 | allowed_peptides = set(z.strip() for z in open(args['allowed_peptides'], 'r').readlines()) 190 | else: 191 | allowed_peptides = allowed_peptides 192 | 193 | logger.info('Total number of TARGET protein GROUPS: %d', len(allowed_prots) / 2) 194 | 195 | for i in range(1, 3, 1): 196 | sample_num = 'S%d' % (i, ) 197 | if args.get(sample_num, 0): 198 | for z in args[sample_num]: 199 | df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', ]) 200 | df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)] 201 | 202 | df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))] 203 | for dbnames in set(df3_tmp['proteins'].values): 204 | for dbname in dbnames.split(';'): 205 | allowed_prots_all.add(dbname) 206 | 207 | df_final = get_df_final(args, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False) 208 | 209 | logger.info('Total number of peptide sequences used in quantitation: %d', len(set(df_final['origseq']))) 210 | 211 | cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')] 212 | df_final = df_final[cols] 213 | 214 | df_final = df_final.set_index('peptide') 215 | 216 | 217 | all_lbls = all_s_lbls['S1'] + all_s_lbls['S2'] 218 | 219 | df_final_copy = df_final.copy() 220 | 221 | custom_min_samples = int(args['min_samples']) 222 | if custom_min_samples == 0: 223 | custom_min_samples = int(len(all_lbls)/2) 224 | 225 | df_final = df_final_copy.copy() 226 | 227 | max_missing = len(all_lbls) - custom_min_samples 228 | 229 | logger.info('Allowed max number of missing values: %d', max_missing) 230 | 231 | df_final['nummissing'] = df_final.isna().sum(axis=1) 232 | df_final['nummissing_S1'] = df_final[all_s_lbls['S1']].isna().sum(axis=1) 233 | df_final['nummissing_S2'] = df_final[all_s_lbls['S2']].isna().sum(axis=1) 234 | df_final['nonmissing_S1'] = len(all_s_lbls['S1']) - df_final['nummissing_S1'] 235 | df_final['nonmissing_S2'] = len(all_s_lbls['S2']) - df_final['nummissing_S2'] 236 | df_final['nonmissing'] = df_final['nummissing'] <= max_missing 237 | 238 | df_final = df_final[df_final['nonmissing']] 239 | logger.info('Total number of PFMs passed missing values threshold: %d', len(df_final)) 240 | 241 | 242 | df_final['S2_mean'] = df_final[all_s_lbls['S2']].mean(axis=1) 243 | df_final['S1_mean'] = df_final[all_s_lbls['S1']].mean(axis=1) 244 | df_final['FC_raw'] = np.log2(df_final['S2_mean']/df_final['S1_mean']) 245 | 246 | FC_max = df_final['FC_raw'].max() 247 | FC_min = df_final['FC_raw'].min() 248 | 249 | df_final.loc[(pd.isna(df_final['S2_mean'])) & (~pd.isna(df_final['S1_mean'])), 'FC_raw'] = FC_min 250 | df_final.loc[(~pd.isna(df_final['S2_mean'])) & (pd.isna(df_final['S1_mean'])), 'FC_raw'] = FC_max 251 | 252 | if args['intensity_norm'] == 2: 253 | for cc in all_lbls: 254 | df_final[cc] = df_final[cc] / df_final[cc].nlargest(1000).sum() 255 | 256 | elif args['intensity_norm'] == 1: 257 | for cc in all_lbls: 258 | df_final[cc] = df_final[cc] / df_final[cc].median() 259 | 260 | for slbl in ['1', '2']: 261 | S_len_current = len(all_s_lbls['S%s' % (slbl, )]) 262 | df_final['S%s_mean' % (slbl, )] = df_final[all_s_lbls['S%s' % (slbl, )]].mean(axis=1) 263 | df_final['S%s_std' % (slbl, )] = np.log2(df_final[all_s_lbls['S%s' % (slbl, )]]).std(axis=1) 264 | 265 | df_final['S1_std'] = df_final['S1_std'].fillna(df_final['S2_std']) 266 | df_final['S2_std'] = df_final['S2_std'].fillna(df_final['S1_std']) 267 | 268 | idx_to_calc_initial_pval = df_final[['nonmissing_S1', 'nonmissing_S2']].min(axis=1) >= 2 269 | df_final.loc[idx_to_calc_initial_pval, 'p-value'] = list(ttest_ind(np.log10(df_final.loc[idx_to_calc_initial_pval, all_s_lbls['S1']].values.astype(float)), np.log10(df_final.loc[idx_to_calc_initial_pval, all_s_lbls['S2']].values.astype(float)), axis=1, nan_policy='omit', equal_var=True)[1]) 270 | df_final['p-value'] = df_final['p-value'].astype(float) 271 | 272 | for cc in all_lbls: 273 | df_final[cc] = df_final[cc].fillna(df_final[cc].min()) 274 | 275 | idx_missing_pval = pd.isna(df_final['p-value']) 276 | 277 | df_final.loc[idx_missing_pval, 'p-value'] = list(ttest_ind(np.log10(df_final.loc[idx_missing_pval, all_s_lbls['S1']].values.astype(float)), np.log10(df_final.loc[idx_missing_pval, all_s_lbls['S2']].values.astype(float)), axis=1, nan_policy='omit', equal_var=True)[1]) 278 | 279 | df_final['p-value'] = df_final['p-value'].fillna(1.0) 280 | p_val_threshold = 0.1 281 | 282 | for cc in all_lbls: 283 | df_final[cc] = df_final[cc].fillna(df_final[cc].min()) 284 | 285 | df_final['intensity_median'] = df_final[['S1_mean', 'S2_mean']].max(axis=1) 286 | df_final['iq'] = pd.qcut(df_final['intensity_median'], 5, labels=range(5)).fillna(0).astype(int) 287 | 288 | df_final['FC'] = np.log2(df_final['S2_mean']/df_final['S1_mean']) 289 | 290 | 291 | FC_max = df_final['FC'].max() 292 | FC_min = df_final['FC'].min() 293 | 294 | df_final_for_calib = df_final.copy() 295 | df_final_for_calib = df_final_for_calib[~pd.isna(df_final_for_calib['S1_mean'])] 296 | df_final_for_calib = df_final_for_calib[df_final_for_calib['FC'] <= FC_max/2] 297 | df_final_for_calib = df_final_for_calib[df_final_for_calib['FC'] >= FC_min/2] 298 | df_final_for_calib = df_final_for_calib[~pd.isna(df_final_for_calib['S2_mean'])] 299 | 300 | df_final.loc[(pd.isna(df_final['S2_mean'])) & (~pd.isna(df_final['S1_mean'])), 'FC'] = FC_min 301 | df_final.loc[(~pd.isna(df_final['S2_mean'])) & (pd.isna(df_final['S1_mean'])), 'FC'] = FC_max 302 | 303 | tmp = df_final_for_calib['FC'] 304 | 305 | try: 306 | FC_mean, FC_std, covvalue_cor = calibrate_mass(0.05, -tmp.min(), tmp.max(), tmp) 307 | FC_mean2, FC_std2, covvalue_cor2 = calibrate_mass(0.1, -tmp.min(), tmp.max(), tmp) 308 | if not np.isinf(covvalue_cor2) and abs(FC_mean2) <= abs(FC_mean) / 10: 309 | FC_mean = FC_mean2 310 | FC_std = FC_std2 311 | except: 312 | FC_mean, FC_std, covvalue_cor = calibrate_mass(0.3, -tmp.min(), tmp.max(), tmp) 313 | 314 | if not args['fold_change_abs']: 315 | fold_change = FC_std * fold_change 316 | logger.info('Absolute FC threshold = %.2f +- %.2f', FC_mean, fold_change) 317 | 318 | df_final['decoy'] = df_final['protein'].apply(lambda x: all(z.startswith(decoy_prefix) for z in x.split(';'))) 319 | 320 | df_final = df_final.assign(protein=df_final['protein'].str.split(';')).explode('protein').reset_index(drop=False) 321 | df_final['proteins'] = df_final['protein'] 322 | df_final = df_final.drop(columns=['protein']) 323 | 324 | df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False)) 325 | df_final = df_final.drop_duplicates(subset=('origseq', 'proteins')) 326 | 327 | 328 | df_final['FC_corrected'] = df_final['FC'] - FC_mean 329 | 330 | if args['protein_shifts']: 331 | df_shifts = pd.read_table(args['protein_shifts']) 332 | shifts_map = df_shifts.set_index('dbname')['FC shift'].to_dict() 333 | df_final['FC_corrected'] = df_final.apply(lambda x: x['FC_corrected'] - shifts_map.get(x['proteins'], 0), axis=1) 334 | 335 | 336 | 337 | 338 | df_final['FC_abs'] = df_final['FC_corrected'].abs() 339 | df_final = df_final.sort_values(by='FC_abs').reset_index(drop=True) 340 | df_final['FC_abs'] = df_final['FC_corrected'] 341 | 342 | idx_stat = df_final['p-value'] <= p_val_threshold 343 | df_final['FC_gr_mean'] = df_final.groupby('proteins', group_keys=False)['FC_abs'].apply(lambda x: x.expanding().mean()) 344 | 345 | neg_idx = (df_final['FC_corrected'] < 0) 346 | pos_idx = (df_final['FC_corrected'] >= 0) 347 | 348 | pos_idx_real = df_final[pos_idx].index 349 | neg_idx_real = df_final[neg_idx].index 350 | 351 | pos_idx_real_set = set(pos_idx_real) 352 | neg_idx_real_set = set(neg_idx_real) 353 | 354 | df_final_decoy = df_final[df_final['decoy']] 355 | 356 | FC_pools = dict() 357 | FC_pools['common'] = dict() 358 | 359 | df1_decoy_grouped_common = df_final[df_final['decoy']].groupby('iq') 360 | for group_name, df_group in df1_decoy_grouped_common: 361 | FC_pools['common'][group_name] = list(df_group[['FC_abs', 'p-value']].values) 362 | 363 | df_final['sign'] = False 364 | 365 | df1_grouped = df_final.groupby('proteins') 366 | 367 | 368 | for group_name, df_group in df1_grouped: 369 | 370 | prot_idx = df_group.index 371 | 372 | idx = sorted(list(prot_idx)) 373 | idx_len = len(idx) 374 | 375 | loc_pos_pvalues = df_final.loc[idx, 'p-value'].values 376 | 377 | loc_pos_values = df_final.loc[idx, 'FC_gr_mean'].values 378 | loc_pos_values = np.abs(loc_pos_values) 379 | 380 | pos_missing_list = list(df_final.loc[idx, 'iq'].values) 381 | better_res = np.array([0] * idx_len) 382 | for _ in range(100): 383 | random_list_tmp = [random.choice(FC_pools['common'][nm]) for nm in pos_missing_list] 384 | random_list = [z[0] for z in random_list_tmp] 385 | random_list_pvalues = np.array([z[1] for z in random_list_tmp]) 386 | 387 | list_to_compare_current = np.cumsum(random_list) / np.arange(1, idx_len+1, 1) 388 | list_to_compare_current = np.abs(list_to_compare_current) 389 | 390 | better_res += (loc_pos_values >= list_to_compare_current) * (loc_pos_pvalues <= random_list_pvalues) 391 | df_final.loc[idx, 'bp'] = better_res 392 | 393 | # Equivalent of 5% probability to be Random 394 | df_final.loc[idx, 'sign'] = df_final.loc[idx, 'bp'] >= 58 395 | 396 | df_final['up'] = df_final['sign'] * (df_final['FC_corrected'] > 0) 397 | df_final['down'] = df_final['sign'] * (df_final['FC_corrected'] < 0) 398 | 399 | cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')] 400 | cols.remove('proteins') 401 | cols.insert(0, 'proteins') 402 | df_final = df_final[cols] 403 | 404 | df_final.to_csv(path_or_buf=args['out']+'_quant_peptides.tsv', sep='\t', index=False) 405 | 406 | df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False)) 407 | df_final = df_final.drop_duplicates(subset=('origseq', 'proteins')) 408 | 409 | prot_to_peps = defaultdict(str) 410 | for dbname, pepseq in df_final.sort_values(by='origseq')[['proteins', 'origseq']].values: 411 | prot_to_peps[dbname] += pepseq 412 | 413 | all_peps_cnt = Counter(list(prot_to_peps.values())) 414 | peps_more_than_2 = set([k for k, v in all_peps_cnt.items() if v >= 2]) 415 | 416 | pep_groups = {} 417 | protein_groups = {} 418 | cur_group = 1 419 | for dbname, pepseq in prot_to_peps.items(): 420 | if pepseq not in peps_more_than_2: 421 | protein_groups[dbname] = cur_group 422 | cur_group += 1 423 | else: 424 | if pepseq not in pep_groups: 425 | pep_groups[pepseq] = cur_group 426 | protein_groups[dbname] = cur_group 427 | cur_group += 1 428 | else: 429 | protein_groups[dbname] = pep_groups[pepseq] 430 | 431 | del pep_groups 432 | del prot_to_peps 433 | del peps_more_than_2 434 | del all_peps_cnt 435 | 436 | up_dict = df_final.groupby('proteins')['up'].sum().to_dict() 437 | down_dict = df_final.groupby('proteins')['down'].sum().to_dict() 438 | 439 | ####### !!!!!!! ####### 440 | df_final['up'] = df_final.apply(lambda x: x['up'] if up_dict.get(x['proteins'], 0) >= down_dict.get(x['proteins'], 0) else x['down'], axis=1) 441 | protsN = df_final.groupby('proteins')['up'].count().to_dict() 442 | 443 | prots_up = df_final.groupby('proteins')['up'].sum() 444 | decoy_df = df_final[df_final['decoy']].drop_duplicates(subset='origseq') 445 | 446 | N_decoy_total = len(decoy_df) 447 | upreg_decoy_total = decoy_df['up'].sum() 448 | 449 | N_nondecoy_total = (~df_final['decoy']).sum() 450 | 451 | p_up = upreg_decoy_total / N_decoy_total 452 | 453 | names_arr = np.array(list(protsN.keys())) 454 | 455 | logger.info('Total number of proteins used in quantitation: %d', sum(not z.startswith(decoy_prefix) for z in names_arr)) 456 | logger.info('Total number of peptides: %d', len(df_final)) 457 | logger.info('Total number of decoy peptides: %d', N_decoy_total) 458 | logger.info('Probability of random peptide to be differentially expressed: %.3f', p_up) 459 | 460 | v_arr = np.array(list(prots_up.get(k, 0) for k in names_arr)) 461 | n_arr = np.array(list(protsN.get(k, 0) for k in names_arr)) 462 | 463 | all_pvals = calc_sf_all(v_arr, n_arr, p_up) 464 | 465 | total_set = set() 466 | total_set_genes = set() 467 | 468 | FC_up_dict_basic = df_final.groupby('proteins')['FC_corrected'].median().to_dict() 469 | FC_up_dict_raw_basic = df_final.groupby('proteins')['FC_raw'].median().to_dict() 470 | 471 | df_final_up_idx = (df_final['up']>0) 472 | 473 | df_final.loc[df_final_up_idx, 'bestmissing'] = df_final.loc[df_final_up_idx, :].groupby('proteins')['nummissing'].transform('min') 474 | 475 | FC_up_dict2 = df_final.loc[df_final_up_idx, :].groupby('proteins')['FC_corrected'].median().to_dict() 476 | FC_up_dict_raw2 = df_final.loc[df_final_up_idx, :].groupby('proteins')['FC_raw'].median().to_dict() 477 | 478 | df_out = pd.DataFrame() 479 | df_out['score'] = all_pvals 480 | df_out['dbname'] = names_arr 481 | 482 | df_out['log2FoldChange(S2/S1)'] = df_out['dbname'].apply(lambda x: FC_up_dict2.get(x)) 483 | df_out['log2FoldChange(S2/S1) no normalization'] = df_out['dbname'].apply(lambda x: FC_up_dict_raw2.get(x)) 484 | 485 | df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1)']), 'log2FoldChange(S2/S1)'] = df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1)']), 'dbname'].apply(lambda x: FC_up_dict_basic.get(x)) 486 | df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1) no normalization']), 'log2FoldChange(S2/S1) no normalization'] = df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1) no normalization']), 'dbname'].apply(lambda x: FC_up_dict_raw_basic.get(x)) 487 | 488 | 489 | df_out.loc[:, 'log2FoldChange(S2/S1) using all peptides'] = df_out.loc[:, 'dbname'].apply(lambda x: FC_up_dict_basic.get(x)) 490 | df_out.loc[:, 'log2FoldChange(S2/S1) using all peptides and no normalization'] = df_out.loc[:, 'dbname'].apply(lambda x: FC_up_dict_raw_basic.get(x)) 491 | 492 | df_out['differentially expressed peptides'] = v_arr 493 | df_out['identified peptides'] = n_arr 494 | 495 | df_out['decoy'] = df_out['dbname'].str.startswith(decoy_prefix) 496 | 497 | df_out = df_out[~df_out['decoy']] 498 | 499 | df_out['protname'] = df_out['dbname'].apply(lambda x: x.split('|')[1] if '|' in x else x) 500 | df_out['protein_quant_group'] = df_out['dbname'].apply(lambda x: protein_groups[x]) 501 | 502 | genes_map = {} 503 | if args['d']: 504 | for prot, protseq in fasta.read(args['d']): 505 | if decoy_prefix not in prot: 506 | try: 507 | prot_name = prot.split('|')[1] 508 | except: 509 | prot_name = prot.split(' ')[0] 510 | try: 511 | gene_name = prot.split('GN=')[1].split(' ')[0] 512 | except: 513 | gene_name = prot.split(' ')[0] 514 | genes_map[prot_name] = gene_name 515 | 516 | df_out['gene'] = df_out['protname'].apply(lambda x: genes_map[x]) 517 | 518 | else: 519 | df_out['gene'] = df_out['protname'] 520 | 521 | 522 | qval_threshold = args['qval'] 523 | 524 | min_matched_peptides = 3 525 | 526 | df_out = df_out.sort_values(by='score', ascending=False).reset_index(drop=True) 527 | 528 | df_out['FC_pass'] = False 529 | df_out['FC_pass'] = df_out['log2FoldChange(S2/S1)'].abs() >= fold_change 530 | 531 | 532 | BH_idx = (df_out['identified peptides'] >= min_matched_peptides) & (df_out['FC_pass']) 533 | 534 | BH_idx_pos = (df_out['log2FoldChange(S2/S1)'] >= 0) & (df_out['identified peptides'] >= min_matched_peptides) 535 | BH_idx_neg = (df_out['log2FoldChange(S2/S1)'] < 0) & (df_out['identified peptides'] >= min_matched_peptides) 536 | 537 | # df_out['p-value'] = 1.0 538 | df_out['p-value'] = 10**(-df_out['score']) 539 | df_out['BH_pass'] = False 540 | 541 | df_out_BH_multiplier = len(set(df_out[BH_idx]['protein_quant_group'])) 542 | lbl_to_use = 'protein_quant_group' 543 | 544 | current_rank = 0 545 | BH_threshold_array = [] 546 | added_groups = set() 547 | for z in df_out[BH_idx][lbl_to_use].values: 548 | if z not in added_groups: 549 | added_groups.add(z) 550 | current_rank += 1 551 | BH_threshold_array.append(-np.log10(current_rank * qval_threshold / df_out_BH_multiplier)) 552 | df_out.loc[BH_idx, 'BH_threshold'] = BH_threshold_array 553 | 554 | df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= df_out.loc[BH_idx, 'BH_threshold'] 555 | df_out.loc[BH_idx, 'FDR_pass'] = df_out.loc[BH_idx, 'score'] >= -np.log10(args['qval']) 556 | # df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= -np.log10(args['qval']) 557 | 558 | # score_threshold = df_out.loc[(df_out['BH_pass']) & (BH_idx)]['score'].min() 559 | # df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= score_threshold 560 | 561 | df_out = df_out.drop(columns = {'decoy'}) 562 | 563 | df_out = df_out[['score', 'p-value', 'dbname', 'log2FoldChange(S2/S1)', 'differentially expressed peptides', 564 | 'identified peptides', 'log2FoldChange(S2/S1) no normalization', 'log2FoldChange(S2/S1) using all peptides', 565 | 'log2FoldChange(S2/S1) using all peptides and no normalization', 'protname', 'protein_quant_group', 'gene', 'FC_pass', 'FDR_pass', 'BH_pass',]]# 'BH_threshold']] 566 | 567 | df_out.to_csv(path_or_buf=args['out']+'_quant_full.tsv', sep='\t', index=False) 568 | 569 | # df_out_f = df_out[(df_out['FDR_pass']) & (df_out['FC_pass'])] 570 | df_out_f = df_out[(df_out['BH_pass']) & (df_out['FC_pass'])] 571 | 572 | df_out_f.to_csv(path_or_buf=args['out']+'.tsv', sep='\t', index=False) 573 | 574 | for z in set(df_out_f['dbname']): 575 | try: 576 | prot_name = z.split('|')[1] 577 | except: 578 | prot_name = z 579 | 580 | gene_name = genes_map.get(prot_name, prot_name) 581 | 582 | total_set.add(prot_name) 583 | total_set_genes.add(gene_name) 584 | 585 | logger.info('Total number of significantly changed proteins: %d', len(total_set)) 586 | logger.info('Total number of significantly changed genes: %d', len(total_set_genes)) 587 | 588 | f1 = open(args['out'] + '_proteins_for_stringdb.txt', 'w') 589 | for z in total_set: 590 | f1.write(z + '\n') 591 | f1.close() 592 | 593 | f1 = open(args['out'] + '_genes_for_stringdb.txt', 'w') 594 | for z in total_set_genes: 595 | f1.write(z + '\n') 596 | f1.close() 597 | 598 | if __name__ == '__main__': 599 | run() 600 | -------------------------------------------------------------------------------- /ms1searchpy/directms1quantmulti.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import argparse 3 | import pandas as pd 4 | import numpy as np 5 | from scipy.stats import binom, ttest_ind 6 | from scipy.optimize import curve_fit 7 | import logging 8 | from pyteomics import fasta 9 | from collections import Counter, defaultdict 10 | import random 11 | random.seed(42) 12 | 13 | from . import directms1quant 14 | from os import path, listdir, makedirs 15 | from copy import copy 16 | import matplotlib.pyplot as plt 17 | import seaborn as sb 18 | 19 | logger = logging.getLogger(__name__) 20 | 21 | def run(): 22 | parser = argparse.ArgumentParser( 23 | description='Prepare LFQ protein table for project with multiple conditions (time-series,\ 24 | concentration, thermal profiling, etc.', 25 | epilog=''' 26 | 27 | Example usage 28 | ------------- 29 | $ directms1quantmulti -S1 sample1_1_proteins_full.tsv sample1_n_proteins_full.tsv -S2 sample2_1_proteins_full.tsv sample2_n_proteins_full.tsv 30 | ------------- 31 | ''', 32 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 33 | 34 | parser.add_argument('-pdir', help='path to project folder with search results', required=True) 35 | parser.add_argument('-d', '-db', help='path to uniprot fasta file used for ms1searchpy', required=True) 36 | parser.add_argument('-samples', help='tsv table with sample details', required=True) 37 | parser.add_argument('-out', help='name of DirectMS1quant output files', default='DQmulti') 38 | parser.add_argument('-norm', help='LFQ normalization: (1) using median of CONTROL group, default; (2) using median of all groups', default=1, type=int) 39 | parser.add_argument('-proteins_for_figure', help='path to proteins for figure plotting', default='', type=str) 40 | parser.add_argument('-figdir', help='path to output folder for figures', default='') 41 | parser.add_argument('-min_samples', help='minimum number of non-missing peptide intensities for DirectMS1Quant pairwise comparisons. 0 means 50%% of input files', default=1) 42 | parser.add_argument('-pep_min_non_missing_samples', help='minimum fraction of files with non missing values for peptide', default=0.5, type=float) 43 | # parser.add_argument('-min_signif_for_pept', help='minimum number of pairwise DE results where peptide should be significant', default=999, type=int) 44 | parser.add_argument('-prefix', help='Decoy prefix. Default DECOY_', default='DECOY_', type=str) 45 | parser.add_argument('-start_stage', help='Can be 1, 2 or 3 to skip any stage which were already done', default=1, type=int) 46 | args = vars(parser.parse_args()) 47 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 48 | datefmt='[%H:%M:%S]', level=logging.INFO) 49 | 50 | process_files(args) 51 | 52 | 53 | def process_files(args): 54 | 55 | infolder = args['pdir'] 56 | ms1folder = args['pdir'] 57 | path_to_fasta = args['d'] 58 | 59 | df1 = pd.read_table(args['samples'], dtype={'group': str, 'condition': str, 'vs': str}) 60 | df1['sample'] = df1['group'] 61 | if 'condition' not in df1.columns: 62 | df1['condition'] = '' 63 | 64 | if 'replicate' not in df1.columns: 65 | df1['replicate'] = 1 66 | 67 | if 'BatchMS' not in df1.columns: 68 | df1['BatchMS'] = 1 69 | 70 | if 'vs' not in df1.columns: 71 | df1['vs'] = df1['condition'] 72 | df1['vs'] = df1['vs'].fillna(df1['condition']) 73 | 74 | df1_filenames = set(df1['File Name']) 75 | 76 | f_dict_map = {} 77 | 78 | for fname, sname, ttime in df1[['File Name', 'sample', 'condition']].values: 79 | f_dict_map[fname] = (sname, ttime) 80 | 81 | 82 | s_files_dict = defaultdict(list) 83 | 84 | for fn in listdir(ms1folder): 85 | if fn.endswith('_proteins_full.tsv'): 86 | f_name = fn.replace('.features_proteins_full.tsv', '') 87 | if f_name in f_dict_map: 88 | s_files_dict[f_dict_map[f_name]].append(path.join(ms1folder, fn)) 89 | 90 | control_label = df1['group'].values[0] 91 | 92 | 93 | all_conditions = df1[df1['group']!=control_label].set_index(['group', 'condition'])['vs'].to_dict() 94 | 95 | outlabel = args['out'] 96 | 97 | 98 | dquant_params_base = { 99 | 'min_samples': args['min_samples'], 100 | 'fold_change': 2.0, 101 | 'bp': 80, 102 | 'minl': 7, 103 | 'qval': 0.05, 104 | 'intensity_norm': 2, 105 | 'allowed_peptides': '', 106 | 'protein_shifts': '', 107 | 'allowed_proteins': '', 108 | 'all_proteins': '', 109 | 'all_pfms': '', 110 | 'fold_change_abs': '', 111 | 'prefix': args['prefix'], 112 | 'd': args['d'], 113 | 'fold_change_no_correction': '', 114 | } 115 | 116 | 117 | if args['start_stage'] <= 1: 118 | 119 | logger.info('Starting Stage 1: Run pairwise DirectMS1Quant runs...') 120 | 121 | for i2, i1_val in all_conditions.items(): 122 | out_name = path.join(ms1folder, '%s_directms1quant_out_%s_vs_%s%s' % (outlabel, ''.join(list(i2)), control_label, i1_val)) 123 | dquant_params = copy(dquant_params_base) 124 | dquant_params['S1'] = s_files_dict[(control_label, i1_val)] 125 | dquant_params['S2'] = s_files_dict[i2] 126 | 127 | dquant_params['out'] = out_name 128 | 129 | directms1quant.process_files(dquant_params) 130 | 131 | else: 132 | logger.info('Skipping Stage 1: Run pairwise DirectMS1Quant runs...') 133 | 134 | 135 | pep_cnt = Counter() 136 | pep_cnt_up = Counter() 137 | for i2, i1_val in all_conditions.items(): 138 | out_name = path.join(ms1folder, '%s_directms1quant_out_%s_vs_%s%s.tsv' % (outlabel, ''.join(list(i2)), control_label, i1_val)) 139 | # if os.path.isfile(out_name): 140 | df0_full = pd.read_table(out_name.replace('.tsv', '_quant_peptides.tsv'), usecols=['origseq', 'up', 'down', 'proteins']) 141 | 142 | 143 | 144 | 145 | up_dict = df0_full.groupby('proteins')['up'].sum().to_dict() 146 | down_dict = df0_full.groupby('proteins')['down'].sum().to_dict() 147 | 148 | ####### !!!!!!! ####### 149 | df0_full['up'] = df0_full.apply(lambda x: x['up'] if up_dict.get(x['proteins'], 0) >= down_dict.get(x['proteins'], 0) else x['down'], axis=1) 150 | 151 | 152 | df0_full = df0_full.sort_values(by='up', ascending=False) 153 | df0_full = df0_full.drop_duplicates(subset=['origseq', 'proteins']) 154 | for pep, up_v, prot_for_pep in df0_full[['origseq', 'up', 'proteins']].values: 155 | if up_v: 156 | pep_cnt_up[(pep, prot_for_pep)] += 1 157 | pep_cnt[(pep, prot_for_pep)] += 1 158 | 159 | 160 | allowed_peptides_base = set(k for k, v in pep_cnt.items()) 161 | logger.info('Total number of quantified peptides: %d', len(allowed_peptides_base)) 162 | allowed_peptides_base_only_sequences = set(k[0] for k in allowed_peptides_base) 163 | 164 | # allowed_peptides_up = set(k for k, v in pep_cnt_up.items() if v >= args['min_signif_for_pept']) 165 | # logger.info('Total number of significant quantified peptides: %d', len(allowed_peptides_up)) 166 | 167 | replace_label = '_proteins_full.tsv' 168 | decoy_prefix = args['prefix'] 169 | 170 | names_map = {} 171 | 172 | args_local = {'S1': []} 173 | args_local['allowed_peptides'] = '' 174 | args_local['allowed_proteins'] = '' 175 | for z in listdir(infolder): 176 | if z.endswith(replace_label): 177 | zname = z.split('.')[0] 178 | if zname in df1_filenames: 179 | args_local['S1'].append(path.join(infolder, z)) 180 | names_map[args_local['S1'][-1]] = zname 181 | 182 | all_s_lbls = {} 183 | 184 | allowed_prots = set() 185 | allowed_prots_all = set() 186 | allowed_peptides = set() 187 | 188 | cnt0 = Counter() 189 | 190 | if args['start_stage'] > 2: 191 | 192 | sample_num = 'S1' 193 | 194 | all_s_lbls[sample_num] = [] 195 | 196 | for z in args_local[sample_num]: 197 | label = sample_num + '_' + z.replace(replace_label, '') 198 | all_s_lbls[sample_num].append(label) 199 | 200 | all_lbls = all_s_lbls['S1'] 201 | out_name = path.join(ms1folder, '%s_peptide_LFQ.tsv' % (outlabel, )) 202 | df_final = pd.read_table(out_name) 203 | 204 | logger.info('Skipping Stage 2: Prepare full LFQ peptide table...') 205 | 206 | else: 207 | logger.info('Starting Stage 2: Prepare full LFQ peptide table...') 208 | 209 | sample_num = 'S1' 210 | 211 | all_s_lbls[sample_num] = [] 212 | 213 | for z in args_local[sample_num]: 214 | label = sample_num + '_' + z.replace(replace_label, '') 215 | all_s_lbls[sample_num].append(label) 216 | 217 | df0 = pd.read_table(z.replace('_proteins_full.tsv', '_proteins.tsv'), usecols=['dbname', ]) 218 | allowed_prots.update(df0['dbname']) 219 | allowed_prots.update([decoy_prefix + z for z in df0['dbname'].values]) 220 | 221 | df0 = pd.read_table(z.replace('_proteins_full.tsv', '_PFMs_ML.tsv'), usecols=['seqs', 'qpreds'])#, 'ch', 'im']) 222 | df0 = df0[df0['qpreds'] <= 10] 223 | df0['seqs'] = df0['seqs']# + df0['ch'].astype(str) + df0['im'].astype(str) 224 | allowed_peptides.update(df0['seqs']) 225 | cnt0.update(df0['seqs']) 226 | 227 | custom_min_samples = 1 228 | 229 | allowed_peptides = set() 230 | for k, v in cnt0.items(): 231 | if v >= custom_min_samples: 232 | allowed_peptides.add(k) 233 | 234 | 235 | logger.info('Total number of TARGET protein GROUPS: %d', len(allowed_prots) / 2) 236 | 237 | sample_num = 'S1' 238 | 239 | if args_local.get(sample_num, 0): 240 | for z in args_local[sample_num]: 241 | df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', 'charge', 'ion_mobility']) 242 | 243 | df3['tmpseq'] = df3['sequence']# + df3['charge'].astype(str) + df3['ion_mobility'].astype(str) 244 | df3 = df3[df3['tmpseq'].apply(lambda x: x in allowed_peptides)] 245 | 246 | df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))] 247 | for dbnames in set(df3_tmp['proteins'].values): 248 | for dbname in dbnames.split(';'): 249 | allowed_prots_all.add(dbname) 250 | 251 | df_final = directms1quant.get_df_final(args_local, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False) 252 | 253 | logger.info('Total number of peptide sequences used in quantitation: %d', len(set(df_final['origseq']))) 254 | 255 | cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')] 256 | df_final = df_final[cols] 257 | 258 | df_final = df_final.set_index('peptide') 259 | 260 | all_lbls = all_s_lbls['S1'] 261 | 262 | df_final_copy = df_final.copy() 263 | 264 | df_final = df_final_copy.copy() 265 | 266 | max_missing = len(all_lbls) - custom_min_samples 267 | 268 | df_final['nummissing'] = df_final.isna().sum(axis=1) 269 | df_final['nonmissing'] = df_final['nummissing'] <= max_missing 270 | 271 | df_final = df_final[df_final['nonmissing']] 272 | logger.info('Total number of PFMs: %d', len(df_final)) 273 | 274 | out_name = path.join(ms1folder, '%s_peptide_LFQ.tsv' % (outlabel, )) 275 | df_final.to_csv(out_name, sep='\t', index=False) 276 | 277 | 278 | 279 | 280 | if args['start_stage'] <= 3: 281 | 282 | logger.info('Starting Stage 3: Prepare full LFQ protein table...') 283 | 284 | 285 | all_lbls_by_batch = defaultdict(list) 286 | all_filenames_control = set(df1[df1['group'] == control_label]['File Name']) 287 | all_lbls_control = set() 288 | 289 | bdict = df1.set_index('File Name')['BatchMS'].to_dict() 290 | 291 | for cc in all_lbls: 292 | try: 293 | origfn = names_map[path.join(infolder, cc.split('/')[-1] + replace_label)] 294 | except: 295 | origfn = cc.split('_', 1)[-1] + replace_label 296 | all_lbls_by_batch[bdict[origfn]].append(cc) 297 | if origfn in all_filenames_control: 298 | all_lbls_control.add(cc) 299 | 300 | 301 | for lbl_name, small_lbls in all_lbls_by_batch.items(): 302 | lbl_len = len(small_lbls) 303 | idx_to_keep = df_final[small_lbls].isna().sum(axis=1) <= (1 - args['pep_min_non_missing_samples']) * lbl_len 304 | df_final = df_final[idx_to_keep] 305 | 306 | for cc in all_lbls: 307 | df_final[cc] = df_final[cc] / df_final[cc].nlargest(1000).sum() 308 | 309 | 310 | 311 | idx_to_keep = df_final['origseq'].apply(lambda x: x in allowed_peptides_base_only_sequences) 312 | df_final = df_final[idx_to_keep] 313 | 314 | 315 | for small_lbls in all_lbls_by_batch.values(): 316 | m = df_final[small_lbls].min(axis=1) 317 | for col in df_final[small_lbls]: 318 | df_final.loc[:, col] = df_final.loc[:, col].fillna(m) 319 | 320 | for lbl_key, small_lbls in all_lbls_by_batch.items(): 321 | 322 | if args['norm'] == 2: 323 | m = df_final[small_lbls].median(axis=1) 324 | elif args['norm'] == 1: 325 | small_lbls_control = [z for z in small_lbls if z in all_lbls_control] 326 | m = df_final[small_lbls_control].median(axis=1) 327 | else: 328 | logger.error('norm option must be 1 or 2!') 329 | return -1 330 | 331 | 332 | for col in df_final[small_lbls]: 333 | df_final.loc[:, col] = np.log2(df_final.loc[:, col] / m) 334 | 335 | df_final = df_final.fillna(-10) 336 | 337 | 338 | df_final = df_final.assign(protein=df_final['protein'].str.split(';')).explode('protein').reset_index(drop=True) 339 | df_final['proteins'] = df_final['protein'] 340 | df_final = df_final.drop(columns=['protein']) 341 | cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')] 342 | cols.remove('proteins') 343 | cols.insert(0, 'proteins') 344 | df_final = df_final[cols] 345 | 346 | idx_to_keep = df_final.apply(lambda x: (x['origseq'], x['proteins']) in allowed_peptides_base, axis=1) 347 | df_final = df_final[idx_to_keep].copy() 348 | 349 | 350 | 351 | # idx_to_keep = df_final['origseq'].apply(lambda x: x in allowed_peptides_up) 352 | # idx_to_keep = df_final.apply(lambda x: (x['origseq'], x['proteins']) in allowed_peptides_up, axis=1) 353 | # df_final_accurate = df_final[idx_to_keep].copy() 354 | # df_final_accurate = df_final_accurate[df_final_accurate.groupby('proteins')['origseq'].transform('count') > 1] 355 | # accurate_proteins = set(df_final_accurate['proteins']) 356 | # df_final = pd.concat([df_final[df_final['proteins'].apply(lambda x: x not in accurate_proteins)], df_final_accurate]) 357 | 358 | 359 | df_final['S1_mean'] = df_final[all_lbls].median(axis=1) 360 | df_final['intensity_median'] = df_final['S1_mean'] 361 | 362 | df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False)) 363 | df_final = df_final.drop_duplicates(subset=('origseq', 'proteins')) 364 | 365 | df_final = df_final[df_final['proteins'].apply(lambda x: not x.startswith('DECOY_'))] 366 | 367 | df_final = df_final[df_final.groupby('proteins')['S1_mean'].transform('count') > 1] 368 | 369 | origfnames = [] 370 | for cc in all_lbls: 371 | origfn = cc.split('/')[-1].split('.')[0] 372 | origfnames.append(origfn) 373 | 374 | dft = pd.DataFrame.from_dict([df_final.groupby('proteins')[cc].median().to_dict() for cc in all_lbls]) 375 | dft2 = pd.DataFrame.from_dict([df_final.groupby('origseq')[cc].median().to_dict() for cc in all_lbls]) 376 | 377 | dft['File Name'] = origfnames 378 | dft2['File Name'] = origfnames 379 | 380 | df1x = pd.merge(df1, dft2, on='File Name', how='left') 381 | df1 = pd.merge(df1, dft, on='File Name', how='left') 382 | 383 | df1['sample+condition'] = df1['sample'].apply(lambda x: x + ' ') + df1['condition'] 384 | df1x['sample+condition'] = df1x['sample'].apply(lambda x: x + ' ') + df1x['condition'] 385 | 386 | 387 | out_name = path.join(ms1folder, '%s_proteins_LFQ.tsv' % (outlabel, )) 388 | out_namex = path.join(ms1folder, '%s_peptide_LFQ_processed.tsv' % (outlabel, )) 389 | df1.to_csv(out_name, sep='\t', index=False) 390 | df1x.to_csv(out_namex, sep='\t', index=False) 391 | 392 | else: 393 | 394 | logger.info('Skipping Stage 3: Prepare full LFQ protein table...') 395 | out_name = path.join(ms1folder, '%s_proteins_LFQ.tsv' % (outlabel, )) 396 | df1 = pd.read_table(out_name) 397 | 398 | if args['start_stage'] <= 4: 399 | 400 | logger.info('Starting Stage 4: Plot figures for selected proteins...') 401 | 402 | warning_msg_1 = 'Provide file with proteins selected for figures.\ 403 | It should be a tsv table with dbname column. For example, it could be the standard output table of directms1quant\ 404 | Note, that protein names should be provided in uniprot format. For example, sp|P28838|AMPL_HUMAN' 405 | 406 | if not args['proteins_for_figure']: 407 | logger.warning(warning_msg_1) 408 | else: 409 | 410 | df_prots = pd.read_table(args['proteins_for_figure']) 411 | if 'dbname' not in df_prots.columns: 412 | logger.warning('dbname column is missing in proteins file') 413 | else: 414 | 415 | allowed_proteins_for_figures = set(df_prots['dbname']) 416 | 417 | for k in allowed_proteins_for_figures: 418 | if k not in df1.columns: 419 | logger.info('Protein %s was not quantified', k) 420 | else: 421 | 422 | 423 | try: 424 | gname = k.split('|')[1] 425 | except: 426 | gname = k 427 | 428 | 429 | plt.figure() 430 | prot_name = k 431 | 432 | all_one_char_colors = ['m', 'r', 'c', 'g', 'b', 'y', ] 433 | 434 | color_custom = {} 435 | for s_idx, s_val in enumerate(set(df1['sample'])): 436 | s_idx_sm = s_idx 437 | while s_idx_sm >= 6: 438 | s_idx_sm -= 6 439 | color_custom[s_val] = all_one_char_colors[s_idx_sm] 440 | 441 | my_pal = dict() 442 | for x_val, s_val in df1[['sample+condition', 'sample']].values: 443 | my_pal[x_val] = color_custom[s_val] 444 | 445 | ax = sb.boxplot(data=df1, x='sample+condition', hue = 'sample+condition', y = prot_name, palette=my_pal, legend=False) 446 | xlabels_custom = [z for z in ax.get_xticklabels()] 447 | ax.set_xticks(ax.get_xticks()) 448 | ax.set_xticklabels(xlabels_custom, rotation=90, size=12) 449 | plt.title(ax.get_ylabel()) 450 | ax.set_ylabel('log2 LFQ', size=14) 451 | plt.subplots_adjust() 452 | plt.tight_layout() 453 | if args['figdir']: 454 | out_figdir = args['figdir'] 455 | if not path.isdir(out_figdir): 456 | makedirs(out_figdir) 457 | else: 458 | out_figdir = infolder 459 | plt.savefig(path.join(out_figdir, '%s_%s.png' % (args['out'], gname, ))) 460 | 461 | 462 | if __name__ == '__main__': 463 | run() 464 | -------------------------------------------------------------------------------- /ms1searchpy/group_specific.py: -------------------------------------------------------------------------------- 1 | from .main import final_iteration, filter_results 2 | from .utils import prot_gen 3 | import pandas as pd 4 | from collections import defaultdict, Counter 5 | import argparse 6 | import logging 7 | import ete3 8 | from ete3 import NCBITaxa 9 | ncbi = NCBITaxa() 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | def run(): 14 | parser = argparse.ArgumentParser( 15 | description='Combine DirectMS1 search results', 16 | epilog=''' 17 | 18 | Example usage 19 | ------------- 20 | $ ms1groups file1_PFMs_ML.tsv -d uniprot_shuffled.fasta -out group_statistics_by -fdr 5.0 -nproc 8 -groups genus 21 | ------------- 22 | ''', 23 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 24 | 25 | parser.add_argument('file', nargs='+', help='input tsv PFMs_ML files for union') 26 | parser.add_argument('-d', '-db', help='path to protein fasta file', required=True) 27 | parser.add_argument('-out', help='prefix output file names', default='group_specific_statistics_by_') 28 | parser.add_argument('-prots_full', help='path to any of *_proteins_full.tsv file. By default this file will be searched in the folder with PFMs_ML files', default='') 29 | parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float) 30 | parser.add_argument('-prefix', help='decoy prefix', default='DECOY_') 31 | parser.add_argument('-nproc', help='number of processes', default=1, type=int) 32 | parser.add_argument('-groups', help="dbname: To use taxonomy in swiss-prot protein name (_HUMAN, _YEAST, etc.). OX: Use OX= from fasta file. Or can be 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain'", default='dbname') 33 | parser.add_argument('-pp', help='protein priority table for keeping protein groups when merge results by scoring', default='') 34 | args = vars(parser.parse_args()) 35 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 36 | datefmt='[%H:%M:%S]', level=logging.INFO) 37 | 38 | group_to_use = args['groups'] 39 | allowed_groups = [ 40 | 'dbname', 'OX', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain' 41 | ] 42 | if group_to_use not in allowed_groups: 43 | logging.critical('group is not correct! Must be: %s', ','.join(allowed_groups)) 44 | return -1 45 | 46 | dbname_map = dict() 47 | ox_map = dict() 48 | for dbinfo, dbseq in prot_gen(args): 49 | dbname = dbinfo.split(' ')[0] 50 | 51 | if group_to_use != 'dbname': 52 | try: 53 | ox = dbinfo.split('OX=')[-1].split(' ')[0] 54 | except: 55 | ox = 'Unknown' 56 | else: 57 | try: 58 | ox = dbinfo.split(' ')[0].split('|')[-1].split('_')[-1] 59 | except: 60 | ox = 'Unknown' 61 | dbname_map[dbname] = ox 62 | 63 | cnt = Counter(dbname_map.values()) 64 | 65 | if group_to_use not in ['dbname', 'OX']: 66 | for ox in cnt.keys(): 67 | 68 | line = ncbi.get_lineage(ox) 69 | ranks = ncbi.get_rank(line) 70 | if group_to_use not in ranks.values(): 71 | logger.warning('%s does not have %s', str(ox), group_to_use) 72 | group_custom = 'OX:' + ox 73 | # print('{} does not have {}'.format(i, group_to_use)) 74 | # continue 75 | 76 | else: 77 | ranks_rev = {k[1]:k[0] for k in ranks.items()} 78 | # print(ranks_rev) 79 | group_custom = ranks_rev[group_to_use] 80 | 81 | ox_map[ox] = group_custom 82 | 83 | 84 | for dbname in list(dbname_map.keys()): 85 | dbname_map[dbname] = ox_map[dbname_map[dbname]] 86 | 87 | cnt = Counter(dbname_map.values()) 88 | 89 | print(cnt.most_common()) 90 | 91 | # return -1 92 | 93 | d_tmp = dict() 94 | 95 | df1 = None 96 | for idx, filen in enumerate(args['file']): 97 | logging.info('Reading file %s' % (filen, )) 98 | df3 = pd.read_csv(filen, sep='\t', usecols=['ids', 'qpreds', 'preds', 'decoy', 'seqs', 'proteins', 'peptide', 'iorig']) 99 | df3['ids'] = df3['ids'].apply(lambda x: '%d:%s' % (idx, str(x))) 100 | df3['fidx'] = idx 101 | 102 | df3 = df3[df3['qpreds'] <= 10] 103 | 104 | 105 | qval_ok = 0 106 | for qval_cur in range(10): 107 | if qval_cur != 10: 108 | df1ut = df3[df3['qpreds'] == qval_cur] 109 | decoy_ratio = df1ut['decoy'].sum() / len(df1ut) 110 | d_tmp[(idx, qval_cur)] = decoy_ratio 111 | # print(filen, qval_cur, decoy_ratio) 112 | 113 | if df1 is None: 114 | df1 = df3 115 | if args['prots_full']: 116 | df2 = pd.read_csv(args['prots_full'], sep='\t') 117 | else: 118 | try: 119 | df2 = pd.read_csv(filen.replace('_PFMs_ML.tsv', '_proteins_full.tsv'), sep='\t') 120 | except: 121 | logging.critical('Proteins_full file is missing!') 122 | break 123 | 124 | else: 125 | df1 = pd.concat([df1, df3], ignore_index=True) 126 | 127 | d_tmp = [z[0] for z in sorted(d_tmp.items(), key=lambda x: x[1])] 128 | qdict = {} 129 | for idx, val in enumerate(d_tmp): 130 | qdict[val] = int(idx / len(args['file'])) 131 | df1['qpreds'] = df1.apply(lambda x: qdict[(x['fidx'], x['qpreds'])], axis=1) 132 | 133 | 134 | pept_prot = defaultdict(set) 135 | group_to_pep = defaultdict(set) 136 | for seq, prots in df1[['seqs', 'proteins']].values: 137 | for dbname in prots.split(';'): 138 | pept_prot[seq].add(dbname) 139 | group_to_pep[dbname_map[dbname]].add(seq) 140 | 141 | protsN = dict() 142 | for dbname, theorpept in df2[['dbname', 'theoretical peptides']].values: 143 | protsN[dbname] = theorpept 144 | 145 | 146 | prefix = args['prefix'] 147 | isdecoy = lambda x: x[0].startswith(prefix) 148 | isdecoy_key = lambda x: x.startswith(prefix) 149 | escore = lambda x: -x[1] 150 | fdr = float(args['fdr']) / 100 151 | 152 | # all_proteins = [] 153 | 154 | base_out_name = args['out'] + group_to_use + '.tsv' 155 | 156 | out_dict = dict() 157 | 158 | for group_name in cnt: 159 | 160 | logging.info(group_name) 161 | 162 | df1_tmp = df1[df1['peptide'].apply(lambda x: x in group_to_pep[group_name])] 163 | 164 | protsN_tmp = dict() 165 | for k, v in protsN.items(): 166 | dbname_map[k] == group_name 167 | protsN_tmp[k] = v 168 | 169 | resdict = dict() 170 | 171 | resdict['qpreds'] = df1_tmp['qpreds'].values 172 | resdict['preds'] = df1_tmp['preds'].values 173 | resdict['seqs'] = df1_tmp['peptide'].values 174 | resdict['ids'] = df1_tmp['ids'].values 175 | resdict['iorig'] = df1_tmp['iorig'].values 176 | 177 | # mass_diff = resdict['qpreds'] 178 | # rt_diff = resdict['qpreds'] 179 | 180 | e_ind = resdict['qpreds'] <= 9 181 | resdict = filter_results(resdict, e_ind) 182 | 183 | mass_diff = resdict['qpreds'] 184 | rt_diff = resdict['qpreds'] 185 | 186 | if args['pp']: 187 | df4 = pd.read_table(args['pp']) 188 | prots_spc_basic2 = df4.set_index('dbname')['score'].to_dict() 189 | else: 190 | prots_spc_basic2 = False 191 | 192 | 193 | 194 | top_proteins = final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN_tmp, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], prots_spc_basic2=prots_spc_basic2, output_all=False) 195 | # all_proteins.extend(top_proteins) 196 | out_dict[group_name] = len(top_proteins) 197 | # print(top_proteins) 198 | print('\n') 199 | # break 200 | 201 | with open(base_out_name, 'w') as output: 202 | output.write('taxid\tproteins\n') 203 | for k, v in out_dict.items(): 204 | output.write('\t'.join((str(k), str(v))) + '\n') 205 | 206 | 207 | 208 | if __name__ == '__main__': 209 | run() 210 | -------------------------------------------------------------------------------- /ms1searchpy/main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | from scipy.stats import scoreatpercentile, rankdata 4 | from scipy.optimize import curve_fit 5 | import operator 6 | from copy import copy, deepcopy 7 | from collections import defaultdict 8 | from pyteomics import parser, mass, auxiliary as aux, achrom 9 | try: 10 | from pyteomics import cmass 11 | except ImportError: 12 | cmass = mass 13 | import subprocess 14 | import tempfile 15 | import pandas as pd 16 | import matplotlib 17 | matplotlib.use('Agg') 18 | from multiprocessing import Queue, Process, cpu_count 19 | try: 20 | import seaborn 21 | seaborn.set(rc={'axes.facecolor':'#ffffff'}) 22 | seaborn.set_style('whitegrid') 23 | except: 24 | pass 25 | 26 | from . import utils 27 | from .utils_figures import plot_outfigures 28 | import lightgbm as lgb 29 | SEED = 50 30 | import warnings 31 | warnings.formatwarning = lambda msg, *args, **kw: str(msg) + '\n' 32 | 33 | import logging 34 | import numpy 35 | import pandas 36 | from sklearn import metrics 37 | import csv 38 | import ast 39 | 40 | 41 | logger = logging.getLogger(__name__) 42 | 43 | def worker_RT(qin, qout, shift, step, RC=False, ns=False, nr=False, win_sys=False): 44 | pepdict = dict() 45 | maxval = len(qin) 46 | start = 0 47 | while start + shift < maxval: 48 | item = qin[start+shift] 49 | pepdict[item] = achrom.calculate_RT(item, RC) 50 | start += step 51 | if win_sys: 52 | return pepdict 53 | else: 54 | qout.put(pepdict) 55 | qout.put(None) 56 | 57 | 58 | 59 | 60 | def calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False): 61 | 62 | if best_base_results is not False: 63 | 64 | sorted_proteins_names = sorted(best_base_results.items(), key=lambda x: -x[1]) 65 | sorted_proteins_names_dict = {} 66 | for idx, k in enumerate(sorted_proteins_names): 67 | sorted_proteins_names_dict[k[0]] = idx 68 | else: 69 | sorted_proteins_names_dict = False 70 | 71 | 72 | prots_spc2 = defaultdict(set) 73 | for pep, proteins in pept_prot.items(): 74 | if pep in p1: 75 | if sorted_proteins_names_dict is not False: 76 | best_pos = min(sorted_proteins_names_dict[k] for k in proteins) 77 | for protein in proteins: 78 | if sorted_proteins_names_dict[protein] == best_pos: 79 | prots_spc2[protein].add(pep) 80 | else: 81 | for protein in proteins: 82 | prots_spc2[protein].add(pep) 83 | 84 | 85 | for k in protsN: 86 | if k not in prots_spc2: 87 | prots_spc2[k] = set([]) 88 | prots_spc = dict((k, len(v)) for k, v in prots_spc2.items()) 89 | 90 | names_arr = np.array(list(prots_spc.keys())) 91 | v_arr = np.array(list(prots_spc.values())) 92 | n_arr = np.array([protsN[k] for k in prots_spc]) 93 | 94 | if p is False: 95 | top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)] 96 | top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)] 97 | p = np.mean(top100decoy_score) / np.mean(top100decoy_N) 98 | logger.info('Probability of random match for theoretical peptide = %.3f', (np.mean(top100decoy_score) / np.mean(top100decoy_N))) 99 | 100 | 101 | prots_spc = dict() 102 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 103 | for idx, k in enumerate(names_arr): 104 | prots_spc[k] = all_pvals[idx] 105 | 106 | if best_base_results is not False: 107 | checked = set() 108 | for k, v in list(prots_spc.items()): 109 | if k not in checked: 110 | if isdecoy_key(k): 111 | if prots_spc.get(k.replace(prefix, ''), -1e6) > v: 112 | del prots_spc[k] 113 | checked.add(k.replace(prefix, '')) 114 | else: 115 | if prots_spc.get(prefix + k, -1e6) > v: 116 | del prots_spc[k] 117 | checked.add(prefix + k) 118 | 119 | return prots_spc, p 120 | 121 | 122 | 123 | 124 | def calibrate_RT_gaus_full(rt_diff_tmp): 125 | RT_left = -min(rt_diff_tmp) 126 | RT_right = max(rt_diff_tmp) 127 | 128 | try: 129 | start_width = (scoreatpercentile(rt_diff_tmp, 95) - scoreatpercentile(rt_diff_tmp, 5)) / 100 130 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(start_width, RT_left, RT_right, rt_diff_tmp) 131 | except: 132 | start_width = (scoreatpercentile(rt_diff_tmp, 95) - scoreatpercentile(rt_diff_tmp, 5)) / 50 133 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(start_width, RT_left, RT_right, rt_diff_tmp) 134 | if np.isinf(covvalue): 135 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(0.1, RT_left, RT_right, rt_diff_tmp) 136 | if np.isinf(covvalue): 137 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(1.0, RT_left, RT_right, rt_diff_tmp) 138 | return XRT_shift, XRT_sigma, covvalue 139 | 140 | def final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, nproc, out_log=False, fname=False, prots_spc_basic2=False, output_all=True, separate_figures=False): 141 | n = nproc 142 | prots_spc_basic = dict() 143 | 144 | p1 = set(resdict['seqs']) 145 | 146 | pep_pid = defaultdict(set) 147 | pid_pep = defaultdict(set) 148 | banned_dict = dict() 149 | for pep, pid in zip(resdict['seqs'], resdict['ids']): 150 | 151 | pep_pid[pep].add(pid) 152 | pid_pep[pid].add(pep) 153 | if pep in banned_dict: 154 | banned_dict[pep] += 1 155 | else: 156 | banned_dict[pep] = 1 157 | 158 | if len(p1): 159 | prots_spc_final = dict() 160 | prots_spc_copy = False 161 | prots_spc2 = False 162 | unstable_prots = set() 163 | p0 = False 164 | 165 | names_arr = False 166 | tmp_spc_new = False 167 | decoy_set = False 168 | 169 | cnt_target_final = 0 170 | cnt_decoy_final = 0 171 | 172 | while 1: 173 | if not prots_spc2: 174 | 175 | best_match_dict = dict() 176 | n_map_dict = defaultdict(list) 177 | for k, v in protsN.items(): 178 | n_map_dict[v].append(k) 179 | 180 | decoy_set = set() 181 | for k in protsN: 182 | if isdecoy_key(k): 183 | decoy_set.add(k) 184 | # decoy_set = list(decoy_set) 185 | 186 | 187 | prots_spc2 = defaultdict(set) 188 | for pep, proteins in pept_prot.items(): 189 | if pep in p1: 190 | for protein in proteins: 191 | prots_spc2[protein].add(pep) 192 | 193 | for k in protsN: 194 | if k not in prots_spc2: 195 | prots_spc2[k] = set([]) 196 | prots_spc2 = dict(prots_spc2) 197 | unstable_prots = set(prots_spc2.keys()) 198 | 199 | top100decoy_N = sum([val for key, val in protsN.items() if isdecoy_key(key)]) 200 | 201 | names_arr = np.array(list(prots_spc2.keys())) 202 | n_arr = np.array([protsN[k] for k in names_arr]) 203 | 204 | tmp_spc_new = dict((k, len(v)) for k, v in prots_spc2.items()) 205 | 206 | 207 | # top100decoy_score_tmp = [tmp_spc_new.get(dprot, 0) for dprot in decoy_set] 208 | # top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp)) 209 | top100decoy_score_tmp = {dprot: tmp_spc_new.get(dprot, 0) for dprot in decoy_set} 210 | top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp.values())) 211 | 212 | prots_spc = tmp_spc_new 213 | if not prots_spc_copy: 214 | prots_spc_copy = deepcopy(prots_spc) 215 | 216 | for v in decoy_set.intersection(unstable_prots): 217 | top100decoy_score_tmp_sum -= top100decoy_score_tmp[v] 218 | top100decoy_score_tmp[v] = prots_spc.get(v, 0) 219 | top100decoy_score_tmp_sum += top100decoy_score_tmp[v] 220 | p = top100decoy_score_tmp_sum / top100decoy_N 221 | 222 | n_change = set(protsN[k] for k in unstable_prots) 223 | if len(best_match_dict) == 0: 224 | n_change = sorted(n_change) 225 | for n_val in n_change: 226 | for k in n_map_dict[n_val]: 227 | v = prots_spc[k] 228 | if n_val not in best_match_dict or v > prots_spc[best_match_dict[n_val]]: 229 | best_match_dict[n_val] = k 230 | n_arr_small = [] 231 | names_arr_small = [] 232 | v_arr_small = [] 233 | # for k, v in best_match_dict.items(): 234 | # n_arr_small.append(k) 235 | # names_arr_small.append(v) 236 | # v_arr_small.append(prots_spc[v]) 237 | 238 | 239 | 240 | max_k = 0 241 | 242 | for k, v in best_match_dict.items(): 243 | num_matched = prots_spc[v] 244 | if num_matched >= max_k: 245 | 246 | max_k = num_matched 247 | 248 | n_arr_small.append(k) 249 | names_arr_small.append(v) 250 | v_arr_small.append(prots_spc[v]) 251 | 252 | 253 | prots_spc_basic = dict() 254 | all_pvals = utils.calc_sf_all(np.array(v_arr_small), n_arr_small, p) 255 | for idx, k in enumerate(names_arr_small): 256 | prots_spc_basic[k] = all_pvals[idx] 257 | 258 | if not p0: 259 | p0 = float(p) 260 | 261 | prots_spc_tmp = dict() 262 | v_arr = np.array([prots_spc[k] for k in names_arr]) 263 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 264 | for idx, k in enumerate(names_arr): 265 | prots_spc_tmp[k] = all_pvals[idx] 266 | 267 | sortedlist_spc = sorted(prots_spc_tmp.items(), key=operator.itemgetter(1))[::-1] 268 | if output_all: 269 | with open(base_out_name + '_proteins_full_noexclusion.tsv', 'w') as output: 270 | output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\n') 271 | for x in sortedlist_spc: 272 | output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]))) + '\n') 273 | 274 | best_prot = utils.keywithmaxval(prots_spc_basic) 275 | 276 | best_score = prots_spc_basic[best_prot] 277 | unstable_prots = set() 278 | if best_prot not in prots_spc_final: 279 | prots_spc_final[best_prot] = best_score 280 | banned_pids = set() 281 | for pep in prots_spc2[best_prot]: 282 | for pid in pep_pid[pep]: 283 | banned_pids.add(pid) 284 | for pid in banned_pids: 285 | for pep in pid_pep[pid]: 286 | banned_dict[pep] -= 1 287 | if banned_dict[pep] == 0: 288 | for bprot in pept_prot[pep]: 289 | tmp_spc_new[bprot] -= 1 290 | unstable_prots.add(bprot) 291 | 292 | 293 | if best_prot in decoy_set: 294 | cnt_decoy_final += 1 295 | else: 296 | cnt_target_final += 1 297 | 298 | 299 | else: 300 | 301 | v_arr = np.array([prots_spc[k] for k in names_arr]) 302 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 303 | for idx, k in enumerate(names_arr): 304 | prots_spc_basic[k] = all_pvals[idx] 305 | 306 | for k, v in prots_spc_basic.items(): 307 | if k not in prots_spc_final: 308 | prots_spc_final[k] = v 309 | 310 | break 311 | try: 312 | prot_fdr = cnt_decoy_final / cnt_target_final 313 | # prot_fdr = aux.fdr(prots_spc_final.items(), is_decoy=isdecoy) 314 | except: 315 | prot_fdr = 100.0 316 | if prot_fdr >= 12.5 * fdr: 317 | 318 | v_arr = np.array([prots_spc[k] for k in names_arr]) 319 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 320 | for idx, k in enumerate(names_arr): 321 | prots_spc_basic[k] = all_pvals[idx] 322 | 323 | for k, v in prots_spc_basic.items(): 324 | if k not in prots_spc_final: 325 | prots_spc_final[k] = v 326 | break 327 | 328 | if prots_spc_basic2 is False: 329 | prots_spc_basic2 = copy(prots_spc_final) 330 | else: 331 | prots_spc_basic2 = prots_spc_basic2 332 | for k in prots_spc_final: 333 | if k not in prots_spc_basic2: 334 | prots_spc_basic2[k] = 0 335 | prots_spc_final = dict() 336 | 337 | if n == 0: 338 | try: 339 | n = cpu_count() 340 | except NotImplementedError: 341 | n = 1 342 | 343 | if n == 1 or os.name == 'nt': 344 | qin = [] 345 | qout = [] 346 | for mass_koef in range(10): 347 | rtt_koef = mass_koef 348 | qin.append((mass_koef, rtt_koef)) 349 | qout = worker(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2, True) 350 | 351 | for item, item2 in qout: 352 | if item2: 353 | prots_spc_copy = item2 354 | for k in protsN: 355 | if k not in prots_spc_final: 356 | prots_spc_final[k] = [item.get(k, 0.0), ] 357 | else: 358 | prots_spc_final[k].append(item.get(k, 0.0)) 359 | 360 | else: 361 | qin = Queue() 362 | qout = Queue() 363 | 364 | for mass_koef in range(10): 365 | rtt_koef = mass_koef 366 | qin.put((mass_koef, rtt_koef)) 367 | 368 | for _ in range(n): 369 | qin.put(None) 370 | 371 | procs = [] 372 | for proc_num in range(n): 373 | p = Process(target=worker, args=(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2)) 374 | p.start() 375 | procs.append(p) 376 | 377 | for _ in range(n): 378 | for item, item2 in iter(qout.get, None): 379 | if item2: 380 | prots_spc_copy = item2 381 | for k in protsN: 382 | if k not in prots_spc_final: 383 | prots_spc_final[k] = [item.get(k, 0.0), ] 384 | else: 385 | prots_spc_final[k].append(item.get(k, 0.0)) 386 | 387 | for p in procs: 388 | p.join() 389 | 390 | for k in prots_spc_final.keys(): 391 | prots_spc_final[k] = np.mean(prots_spc_final[k]) 392 | 393 | prots_spc = deepcopy(prots_spc_final) 394 | sortedlist_spc = sorted(prots_spc.items(), key=operator.itemgetter(1))[::-1] 395 | if output_all: 396 | with open(base_out_name + '_proteins_full.tsv', 'w') as output: 397 | output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\tdecoy\n') 398 | for x in sortedlist_spc: 399 | output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x)))) + '\n') 400 | 401 | checked = set() 402 | for k, v in list(prots_spc.items()): 403 | if k not in checked: 404 | if isdecoy_key(k): 405 | if prots_spc.get(k.replace(prefix, ''), -1e6) > v: 406 | del prots_spc[k] 407 | checked.add(k.replace(prefix, '')) 408 | else: 409 | if prots_spc.get(prefix + k, -1e6) > v: 410 | del prots_spc[k] 411 | checked.add(prefix + k) 412 | 413 | filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=1) 414 | if len(filtered_prots) < 1: 415 | filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=0) 416 | identified_proteins = 0 417 | 418 | for x in filtered_prots: 419 | identified_proteins += 1 420 | 421 | logger.info('TOP 5 identified proteins:') 422 | logger.info('dbname\tscore\tmatched peptides\ttheoretical peptides') 423 | for x in filtered_prots[:5]: 424 | logger.info('\t'.join((str(x[0]), str(x[1]), str(int(prots_spc_copy[x[0]])), str(protsN[x[0]])))) 425 | logger.info('Final stage search: identified proteins = %d', identified_proteins) 426 | 427 | if output_all: 428 | with open(base_out_name + '_proteins.tsv', 'w') as output: 429 | output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\tdecoy\n') 430 | for x in filtered_prots: 431 | output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x)))) + '\n') 432 | 433 | if out_log or (fname and identified_proteins > 10): 434 | df1_peptides = pd.read_table(base_out_name + '_PFMs.tsv') 435 | df1_peptides['decoy'] = df1_peptides['proteins'].apply(lambda x: any(isdecoy_key(z) for z in x.split(';'))) 436 | 437 | df1_proteins = pd.read_table(base_out_name + '_proteins_full.tsv') 438 | df1_proteins_f = pd.read_table(base_out_name + '_proteins.tsv') 439 | top_proteins = set(df1_proteins_f['dbname']) 440 | df1_peptides_f = df1_peptides[df1_peptides['proteins'].apply(lambda x: any(z in top_proteins for z in x.split(';')))] 441 | 442 | if fname and identified_proteins > 10: 443 | 444 | df0 = pd.read_table(os.path.splitext(fname)[0] + '.tsv') 445 | 446 | plot_outfigures(df0, df1_peptides, df1_peptides_f, 447 | base_out_name, df_proteins=df1_proteins, 448 | df_proteins_f=df1_proteins_f, prefix=prefix, separate_figures=separate_figures) 449 | 450 | if out_log is not False: 451 | df1_peptides_f = df1_peptides_f.sort_values(by='Intensity') 452 | df1_peptides_f = df1_peptides_f.drop_duplicates(subset='sequence', keep='last') 453 | dynamic_range_estimation = np.log10(df1_peptides_f['Intensity'].quantile(0.99)) - np.log10(df1_peptides_f['Intensity'].quantile(0.01)) 454 | out_log.write('Estimated dynamic range in Log10 scale: %.1f\n' % (dynamic_range_estimation, )) 455 | out_log.write('Matched peptides for top-scored proteins: %d\n' % (len(df1_peptides_f), )) 456 | out_log.write('Identified proteins: %d\n' % (identified_proteins, )) 457 | out_log.close() 458 | 459 | return top_proteins 460 | 461 | else: 462 | top_proteins_fullinfo = [] 463 | for x in filtered_prots: 464 | top_proteins_fullinfo.append((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x)))) 465 | 466 | 467 | return top_proteins_fullinfo 468 | 469 | def noisygaus(x, a, x0, sigma, b): 470 | return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) + b 471 | 472 | def calibrate_mass(bwidth, mass_left, mass_right, true_md): 473 | 474 | bbins = np.arange(-mass_left, mass_right, bwidth) 475 | H1, b1 = np.histogram(true_md, bins=bbins) 476 | b1 = b1 + bwidth 477 | b1 = b1[:-1] 478 | 479 | 480 | popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), 1, 1]) 481 | mass_shift, mass_sigma = popt[1], abs(popt[2]) 482 | return mass_shift, mass_sigma, pcov[0][0] 483 | 484 | def calibrate_RT_gaus(bwidth, mass_left, mass_right, true_md): 485 | 486 | bbins = np.arange(-mass_left, mass_right, bwidth) 487 | H1, b1 = np.histogram(true_md, bins=bbins) 488 | b1 = b1 + bwidth 489 | b1 = b1[:-1] 490 | 491 | 492 | popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), bwidth * 5, 1]) 493 | mass_shift, mass_sigma = popt[1], abs(popt[2]) 494 | return mass_shift, mass_sigma, pcov[0][0] 495 | 496 | def process_file(args): 497 | utils.seen_target.clear() 498 | utils.seen_decoy.clear() 499 | args = utils.prepare_decoy_db(args) 500 | for filename in args['files']: 501 | 502 | # Temporary for pyteomics <= Version 4.5.5 bug 503 | from pyteomics import mass 504 | if 'H-' in mass.std_aa_mass: 505 | del mass.std_aa_mass['H-'] 506 | if '-OH' in mass.std_aa_mass: 507 | del mass.std_aa_mass['-OH'] 508 | 509 | try: 510 | args['file'] = filename 511 | process_peptides(deepcopy(args)) 512 | except Exception as e: 513 | logger.error(e) 514 | logger.error('Search is failed for file: %s', filename) 515 | return 1 516 | 517 | 518 | 519 | def prepare_peptide_processor(fname, args): 520 | global nmasses 521 | global rts 522 | global charges 523 | global ids 524 | global Is 525 | global Isums 526 | global Scans 527 | global Isotopes 528 | global mzraw 529 | global avraw 530 | global imraw 531 | 532 | min_ch = args['cmin'] 533 | max_ch = args['cmax'] 534 | 535 | min_isotopes = args['i'] 536 | min_scans = args['sc'] 537 | 538 | logger.info('Reading file %s', fname) 539 | 540 | df_features = utils.iterate_spectra(fname, min_ch, max_ch, min_isotopes, min_scans, args['nproc'], args['check_unique']) 541 | 542 | # Sort by neutral mass 543 | df_features = df_features.sort_values(by='massCalib') 544 | 545 | nmasses = df_features['massCalib'].values 546 | rts = df_features['rtApex'].values 547 | charges = df_features['charge'].values 548 | ids = df_features['id'].values 549 | Is = df_features['intensityApex'].values 550 | if 'intensitySum' in df_features.columns: 551 | Isums = df_features['intensitySum'].values 552 | else: 553 | Isums = df_features['intensityApex'].values 554 | logger.info('intensitySum column is missing in peptide features. Using intensityApex instead') 555 | 556 | Scans = df_features['nScans'].values 557 | Isotopes = df_features['nIsotopes'].values 558 | mzraw = df_features['mz'].values 559 | avraw = np.zeros(len(df_features)) 560 | if len(set(df_features['FAIMS'])) > 1: 561 | imraw = df_features['FAIMS'].values 562 | else: 563 | imraw = df_features['im'].values 564 | 565 | logger.info('Number of peptide isotopic clusters passed filters: %d', len(nmasses)) 566 | 567 | aa_mass, aa_to_psi = utils.get_aa_mass_with_fixed_mods(args['fmods'], args['fmods_legend']) 568 | 569 | acc_l = args['ptol'] 570 | acc_r = args['ptol'] 571 | 572 | return {'aa_mass': aa_mass, 'aa_to_psi': aa_to_psi, 'acc_l': acc_l, 'acc_r': acc_r, 'args': args}, df_features 573 | 574 | def get_resdict(it, **kwargs): 575 | 576 | resdict = { 577 | 'seqs': [], 578 | 'md': [], 579 | 'mods': [], 580 | 'iorig': [], 581 | } 582 | 583 | for seqm in it: 584 | results = [] 585 | m = cmass.fast_mass(seqm, aa_mass=kwargs['aa_mass']) + kwargs['aa_mass'].get('Nterm', 0) + kwargs['aa_mass'].get('Cterm', 0) 586 | acc_l = kwargs['acc_l'] 587 | acc_r = kwargs['acc_r'] 588 | dm_l = acc_l * m / 1.0e6 589 | if acc_r == acc_l: 590 | dm_r = dm_l 591 | else: 592 | dm_r = acc_r * m / 1.0e6 593 | start = nmasses.searchsorted(m - dm_l) 594 | end = nmasses.searchsorted(m + dm_r, side='right') 595 | for i in range(start, end): 596 | massdiff = (m - nmasses[i]) / m * 1e6 597 | mods = 0 598 | 599 | resdict['seqs'].append(seqm) 600 | resdict['md'].append(massdiff) 601 | resdict['mods'].append(mods) 602 | resdict['iorig'].append(i) 603 | 604 | 605 | for k in list(resdict.keys()): 606 | resdict[k] = np.array(resdict[k]) 607 | 608 | return resdict 609 | 610 | 611 | 612 | def filter_results(resultdict, idx): 613 | tmp = dict() 614 | for label in resultdict: 615 | tmp[label] = resultdict[label][idx] 616 | return tmp 617 | 618 | def process_peptides(args): 619 | logger.info('Starting search...') 620 | 621 | fname_orig = args['file'] 622 | if fname_orig.lower().endswith('mzml'): 623 | fname = os.path.splitext(fname_orig)[0] + '.features.tsv' 624 | else: 625 | fname = fname_orig 626 | 627 | fdr = args['fdr'] / 100 628 | min_isotopes_calibration = args['ci'] 629 | min_scans_calibration = args['csc'] 630 | try: 631 | outpath = args['o'] 632 | except: 633 | outpath = False 634 | 635 | 636 | if outpath: 637 | base_out_name = os.path.splitext(os.path.join(outpath, os.path.basename(fname)))[0] 638 | else: 639 | base_out_name = os.path.splitext(fname)[0] 640 | 641 | out_log = open(base_out_name + '_log.txt', 'w') 642 | out_log.close() 643 | out_log = open(base_out_name + '_log.txt', 'w') 644 | out_log.write('Filename: %s\n' % (fname, )) 645 | out_log.write('Used parameters: %s\n' % (args.items(), )) 646 | 647 | deeplc_path = args['deeplc'] 648 | if deeplc_path: 649 | from deeplc import DeepLC 650 | logging.getLogger('deeplc').setLevel(logging.ERROR) 651 | 652 | deeplc_model_path = args['deeplc_model_path'] 653 | deeplc_model_path = deeplc_model_path.strip() 654 | 655 | if len(deeplc_model_path) > 0: 656 | if deeplc_model_path.endswith('.hdf5'): 657 | path_model = deeplc_model_path 658 | else: 659 | path_model = [os.path.join(deeplc_model_path,f) for f in os.listdir(deeplc_model_path) if f.endswith(".hdf5")] 660 | 661 | else: 662 | path_model = None 663 | 664 | if args['deeplc_library']: 665 | 666 | path_to_lib = args['deeplc_library'] 667 | if not os.path.isfile(path_to_lib): 668 | lib_file = open(path_to_lib, 'w') 669 | lib_file.close() 670 | write_library = True 671 | 672 | else: 673 | path_to_lib = None 674 | write_library = False 675 | 676 | 677 | 678 | calib_path = args['pl'] 679 | calib_path = calib_path.strip() 680 | 681 | if calib_path and args['ts']: 682 | args['ts'] = 0 683 | logger.info('Two-stage RT prediction does not work with list of MS/MS identified peptides...') 684 | 685 | args['enzyme'] = utils.get_enzyme(args['e']) 686 | 687 | 688 | prefix = args['prefix'] 689 | protsN, pept_prot, ml_correction = utils.get_prot_pept_map(args) 690 | 691 | kwargs, df_features = prepare_peptide_processor(fname_orig, args) 692 | logger.info('Running the search ...') 693 | out_log.write('Number of features: %d\n' % (len(df_features), )) 694 | 695 | resdict = get_resdict(pept_prot, **kwargs) 696 | 697 | aa_to_psi = kwargs['aa_to_psi'] 698 | 699 | if args['mc'] > 0: 700 | resdict['mc'] = np.array([parser.num_sites(z, args['enzyme']) for z in resdict['seqs']]) 701 | 702 | isdecoy = lambda x: x[0].startswith(prefix) 703 | isdecoy_key = lambda x: x.startswith(prefix) 704 | escore = lambda x: -x[1] 705 | 706 | 707 | 708 | e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration 709 | resdict2 = filter_results(resdict, e_ind) 710 | 711 | e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration 712 | resdict2 = filter_results(resdict2, e_ind) 713 | 714 | e_ind = resdict2['mods'] == 0 715 | resdict2 = filter_results(resdict2, e_ind) 716 | 717 | if args['mc'] > 0: 718 | e_ind = resdict2['mc'] == 0 719 | resdict2 = filter_results(resdict2, e_ind) 720 | 721 | p1 = set(resdict2['seqs']) 722 | 723 | if len(p1): 724 | 725 | # Calculate basic protein scores including homologues 726 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False) 727 | # Calculate basic protein scores excluding homologues 728 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p) 729 | 730 | 731 | 732 | 733 | # prots_spc2 = defaultdict(set) 734 | # for pep, proteins in pept_prot.items(): 735 | # if pep in p1: 736 | # for protein in proteins: 737 | # prots_spc2[protein].add(pep) 738 | 739 | # for k in protsN: 740 | # if k not in prots_spc2: 741 | # prots_spc2[k] = set([]) 742 | # prots_spc = dict((k, len(v)) for k, v in prots_spc2.items()) 743 | 744 | # names_arr = np.array(list(prots_spc.keys())) 745 | # v_arr = np.array(list(prots_spc.values())) 746 | # n_arr = np.array([protsN[k] for k in prots_spc]) 747 | 748 | # top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)] 749 | # top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)] 750 | # p = np.mean(top100decoy_score) / np.mean(top100decoy_N) 751 | # logger.info('Stage 0 search: probability of random match for theoretical peptide = %.3f', (np.mean(top100decoy_score) / np.mean(top100decoy_N))) 752 | 753 | # prots_spc = dict() 754 | # all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 755 | # for idx, k in enumerate(names_arr): 756 | # prots_spc[k] = all_pvals[idx] 757 | 758 | # # checked = set() 759 | # # for k, v in list(prots_spc.items()): 760 | # # if k not in checked: 761 | # # if isdecoy_key(k): 762 | # # if prots_spc.get(k.replace(prefix, ''), -1e6) > v: 763 | # # del prots_spc[k] 764 | # # checked.add(k.replace(prefix, '')) 765 | # # else: 766 | # # if prots_spc.get(prefix + k, -1e6) > v: 767 | # # del prots_spc[k] 768 | # # checked.add(prefix + k) 769 | 770 | 771 | 772 | 773 | 774 | # for k in protsN: 775 | # if k not in prots_spc: 776 | # prots_spc[k] = 0 777 | 778 | # sorted_proteins_names = sorted(prots_spc.items(), key=lambda x: -x[1]) 779 | # sorted_proteins_names_dict = {} 780 | # for idx, k in enumerate(sorted_proteins_names): 781 | # sorted_proteins_names_dict[k[0]] = idx 782 | 783 | # prots_spc2 = defaultdict(set) 784 | # for pep, proteins in pept_prot.items(): 785 | # if pep in p1: 786 | # best_pos = min(sorted_proteins_names_dict[k] for k in proteins) 787 | # for protein in proteins: 788 | # if sorted_proteins_names_dict[protein] == best_pos: 789 | # prots_spc2[protein].add(pep) 790 | 791 | # for k in protsN: 792 | # if k not in prots_spc2: 793 | # prots_spc2[k] = set([]) 794 | # prots_spc = dict((k, len(v)) for k, v in prots_spc2.items()) 795 | 796 | # names_arr = np.array(list(prots_spc.keys())) 797 | # v_arr = np.array(list(prots_spc.values())) 798 | # n_arr = np.array([protsN[k] for k in prots_spc]) 799 | 800 | # prots_spc = dict() 801 | # all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 802 | # for idx, k in enumerate(names_arr): 803 | # prots_spc[k] = all_pvals[idx] 804 | 805 | # checked = set() 806 | # for k, v in list(prots_spc.items()): 807 | # if k not in checked: 808 | # if isdecoy_key(k): 809 | # if prots_spc.get(k.replace(prefix, ''), -1e6) > v: 810 | # del prots_spc[k] 811 | # checked.add(k.replace(prefix, '')) 812 | # else: 813 | # if prots_spc.get(prefix + k, -1e6) > v: 814 | # del prots_spc[k] 815 | # checked.add(prefix + k) 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, 836 | full_output=True) 837 | 838 | identified_proteins = 0 839 | 840 | for x in filtered_prots: 841 | identified_proteins += 1 842 | logger.info('Stage 0 search: identified proteins = %d', identified_proteins) 843 | if identified_proteins <= 25: 844 | logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...') 845 | filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25] 846 | 847 | logger.info('Running mass recalibration...') 848 | 849 | df1 = pd.DataFrame() 850 | df1['mass diff'] = resdict['md'] 851 | df1['mc'] = (resdict['mc'] if args['mc'] > 0 else 0) 852 | df1['iorig'] = resdict['iorig'] 853 | df1['seqs'] = resdict['seqs'] 854 | 855 | df1['mods'] = resdict['mods'] 856 | 857 | df1['nIsotopes'] = [Isotopes[iorig] for iorig in df1['iorig'].values] 858 | df1['nScans'] = [Scans[iorig] for iorig in df1['iorig'].values] 859 | 860 | # df1['orig_md'] = true_md 861 | 862 | 863 | true_seqs = set() 864 | true_prots = set(x[0] for x in filtered_prots) 865 | for pep, proteins in pept_prot.items(): 866 | if any(protein in true_prots for protein in proteins): 867 | true_seqs.add(pep) 868 | 869 | df1['top_peps'] = (df1['mc'] == 0) & (df1['seqs'].apply(lambda x: x in true_seqs) & (df1['nIsotopes'] >= min_isotopes_calibration) & (df1['nScans'] >= min_scans_calibration)) 870 | 871 | mass_calib_arg = args['mcalib'] 872 | 873 | assert mass_calib_arg in [0, 1, 2] 874 | 875 | if mass_calib_arg: 876 | df1['RT'] = rts[df1['iorig'].values]#df1['iorig'].apply(lambda x: rts[x]) 877 | 878 | if mass_calib_arg == 2: 879 | df1['im'] = imraw[df1['iorig'].values]#df1['iorig'].apply(lambda x: imraw[x]) 880 | elif mass_calib_arg == 1: 881 | df1['im'] = 0 882 | 883 | im_set = set(df1['im'].unique()) 884 | if len(im_set) <= 5: 885 | df1['im_qcut'] = df1['im'] 886 | for im_value in im_set: 887 | idx1 = df1['im'] == im_value 888 | df1.loc[idx1, 'qpreds'] = str(im_value) + pd.qcut(df1.loc[idx1, 'RT'], 5, labels=range(5)).astype(str) 889 | else: 890 | df1['im_qcut'] = pd.qcut(df1['im'], 5, labels=range(5)).astype(str) 891 | im_set = set(df1['im_qcut'].unique()) 892 | for im_value in set(df1['im_qcut'].unique()): 893 | idx1 = df1['im_qcut'] == im_value 894 | df1.loc[idx1, 'qpreds'] = str(im_value) + pd.qcut(df1.loc[idx1, 'RT'], 5, labels=range(5)).astype(str) 895 | 896 | # df1['qpreds'] = pd.qcut(df1['RT'], 10, labels=range(10))#.astype(int) 897 | 898 | cor_dict = df1[df1['top_peps']].groupby('qpreds')['mass diff'].median().to_dict() 899 | 900 | rt_q_list = list(range(5)) 901 | for im_value in im_set: 902 | for rt_q in rt_q_list: 903 | lbl_cur = str(im_value) + str(rt_q) 904 | if lbl_cur not in cor_dict: 905 | 906 | best_diff = 1e6 907 | best_val = 0 908 | for rt_q2 in rt_q_list: 909 | cur_diff = abs(rt_q - rt_q2) 910 | if cur_diff != 0: 911 | lbl_cur2 = str(im_value) + str(rt_q2) 912 | if lbl_cur2 in cor_dict: 913 | if cur_diff < best_diff: 914 | best_diff = cur_diff 915 | best_val = cor_dict[lbl_cur2] 916 | 917 | cor_dict[lbl_cur] = best_val 918 | 919 | df1['mass diff q median'] = df1['qpreds'].apply(lambda x: cor_dict[x]) 920 | df1['mass diff corrected'] = df1['mass diff'] - df1['mass diff q median'] 921 | 922 | else: 923 | df1['qpreds'] = 0 924 | df1['mass diff q median'] = 0 925 | df1['mass diff corrected'] = df1['mass diff'] 926 | 927 | 928 | 929 | 930 | mass_left = args['ptol'] 931 | mass_right = args['ptol'] 932 | 933 | try: 934 | mass_shift_cor, mass_sigma_cor, covvalue_cor = calibrate_mass(0.001, mass_left, mass_right, df1[df1['top_peps']]['mass diff corrected']) 935 | except: 936 | mass_shift_cor, mass_sigma_cor, covvalue_cor = calibrate_mass(0.01, mass_left, mass_right, df1[df1['top_peps']]['mass diff corrected']) 937 | 938 | if mass_calib_arg: 939 | 940 | try: 941 | mass_shift, mass_sigma, covvalue = calibrate_mass(0.001, mass_left, mass_right, df1[df1['top_peps']]['mass diff']) 942 | except: 943 | mass_shift, mass_sigma, covvalue = calibrate_mass(0.01, mass_left, mass_right, df1[df1['top_peps']]['mass diff']) 944 | 945 | logger.info('Uncalibrated mass shift: %.3f ppm', mass_shift) 946 | logger.info('Uncalibrated mass sigma: %.3f ppm', mass_sigma) 947 | 948 | logger.info('Estimated mass shift: %.3f ppm', mass_shift_cor) 949 | logger.info('Estimated mass sigma: %.3f ppm', mass_sigma_cor) 950 | 951 | out_log.write('Estimated mass shift: %.3f ppm\n' % (mass_shift_cor, )) 952 | out_log.write('Estimated mass sigma: %.3f ppm\n' % (mass_sigma_cor, )) 953 | 954 | resdict['md'] = df1['mass diff corrected'].values 955 | 956 | mass_shift = mass_shift_cor 957 | mass_sigma = mass_sigma_cor 958 | 959 | e_all = abs(resdict['md'] - mass_shift) / (mass_sigma) 960 | r = 3.0 961 | e_ind = e_all <= r 962 | resdict = filter_results(resdict, e_ind) 963 | 964 | 965 | e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration 966 | resdict2 = filter_results(resdict, e_ind) 967 | 968 | 969 | e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration 970 | resdict2 = filter_results(resdict2, e_ind) 971 | 972 | e_ind = resdict2['mods'] == 0 973 | resdict2 = filter_results(resdict2, e_ind) 974 | 975 | 976 | if args['mc'] > 0: 977 | e_ind = resdict2['mc'] == 0 978 | resdict2 = filter_results(resdict2, e_ind) 979 | 980 | p1 = set(resdict2['seqs']) 981 | 982 | 983 | # Calculate basic protein scores including homologues 984 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False) 985 | # Calculate basic protein scores excluding homologues 986 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p) 987 | 988 | 989 | 990 | 991 | 992 | filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, 993 | full_output=True) 994 | 995 | identified_proteins = 0 996 | 997 | for x in filtered_prots: 998 | identified_proteins += 1 999 | logger.info('Stage 1 search: identified proteins = %d', identified_proteins) 1000 | if identified_proteins <= 25: 1001 | logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...') 1002 | filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25] 1003 | 1004 | 1005 | 1006 | logger.info('Running RT prediction...') 1007 | 1008 | 1009 | e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= 1 1010 | resdict2 = filter_results(resdict, e_ind) 1011 | 1012 | e_ind = resdict2['mods'] == 0 1013 | resdict2 = filter_results(resdict2, e_ind) 1014 | 1015 | if args['mc'] > 0: 1016 | e_ind = resdict2['mc'] == 0 1017 | resdict2 = filter_results(resdict2, e_ind) 1018 | 1019 | 1020 | true_seqs = [] 1021 | true_rt = [] 1022 | true_isotopes = [] 1023 | true_prots = set(x[0] for x in filtered_prots) 1024 | for pep, proteins in pept_prot.items(): 1025 | if any(protein in true_prots for protein in proteins): 1026 | true_seqs.append(pep) 1027 | e_ind = np.in1d(resdict2['seqs'], true_seqs) 1028 | 1029 | 1030 | true_seqs = resdict2['seqs'][e_ind] 1031 | 1032 | true_rt.extend(np.array([rts[iorig] for iorig in resdict2['iorig']])[e_ind]) 1033 | true_rt = np.array(true_rt) 1034 | true_isotopes.extend(np.array([Isotopes[iorig] for iorig in resdict2['iorig']])[e_ind]) 1035 | true_isotopes = np.array(true_isotopes) 1036 | 1037 | e_all = abs(resdict2['md'][e_ind] - mass_shift) / (mass_sigma) 1038 | zs_all_tmp = e_all ** 2 1039 | 1040 | zs_all_tmp += (true_isotopes.max() - true_isotopes) * 100 1041 | 1042 | e_ind = np.argsort(zs_all_tmp) 1043 | true_seqs = true_seqs[e_ind] 1044 | true_rt = true_rt[e_ind] 1045 | 1046 | true_seqs = true_seqs[:2500] 1047 | true_rt = true_rt[:2500] 1048 | 1049 | per_ind = np.random.RandomState(seed=SEED).permutation(len(true_seqs)) 1050 | true_seqs = true_seqs[per_ind] 1051 | true_rt = true_rt[per_ind] 1052 | 1053 | best_seq = defaultdict(list) 1054 | newseqs = [] 1055 | newRTs = [] 1056 | for seq, RT in zip(true_seqs, true_rt): 1057 | best_seq[seq].append(RT) 1058 | for k, v in best_seq.items(): 1059 | newseqs.append(k) 1060 | newRTs.append(np.median(v)) 1061 | true_seqs = np.array(newseqs) 1062 | true_rt = np.array(newRTs) 1063 | 1064 | if calib_path: 1065 | df1 = pd.read_csv(calib_path, sep='\t') 1066 | true_seqs = df1['peptide'].values 1067 | true_rt = df1['RT exp'].values 1068 | 1069 | ll = len(true_seqs) 1070 | true_seqs2 = true_seqs[int(ll/2):] 1071 | true_rt2 = true_rt[int(ll/2):] 1072 | true_seqs = true_seqs[:int(ll/2)] 1073 | true_rt = true_rt[:int(ll/2)] 1074 | 1075 | else: 1076 | 1077 | ll = len(true_seqs) 1078 | 1079 | true_seqs2 = true_seqs[int(ll/2):] 1080 | true_rt2 = true_rt[int(ll/2):] 1081 | true_seqs = true_seqs[:int(ll/2)] 1082 | true_rt = true_rt[:int(ll/2)] 1083 | 1084 | ns = true_seqs 1085 | nr = true_rt 1086 | ns2 = true_seqs2 1087 | nr2 = true_rt2 1088 | 1089 | 1090 | logger.info('First-stage peptides used for RT prediction: %d', len(true_seqs)) 1091 | 1092 | # if args['ts'] != 2 and deeplc_path: 1093 | 1094 | # dlc = DeepLC(verbose=False, batch_num=args['deeplc_batch_num'], path_model=path_model, write_library=write_library, use_library=path_to_lib, pygam_calibration=False) 1095 | 1096 | 1097 | # df_for_calib = pd.DataFrame({ 1098 | # 'seq': ns2, 1099 | # 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns2], 1100 | # 'tr': nr2, 1101 | # }) 1102 | 1103 | # df_for_check = pd.DataFrame({ 1104 | # 'seq': ns, 1105 | # 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns], 1106 | # 'tr': nr, 1107 | # }) 1108 | 1109 | # try: 1110 | # dlc.calibrate_preds(seq_df=df_for_calib, check_df=df_for_check) 1111 | # except: 1112 | # dlc.calibrate_preds(seq_df=df_for_calib) 1113 | 1114 | # df_for_check['pr'] = dlc.make_preds(seq_df=df_for_check) 1115 | # df_for_calib['pr'] = dlc.make_preds(seq_df=df_for_calib) 1116 | # nr2_pred = df_for_calib['pr'] 1117 | 1118 | # rt_diff_tmp = df_for_check['pr'] - df_for_check['tr'] 1119 | 1120 | # XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp) 1121 | 1122 | # else: 1123 | RC = achrom.get_RCs_vary_lcp(ns2, nr2, metric='mae') 1124 | nr2_pred = np.array([achrom.calculate_RT(s, RC) for s in ns2]) 1125 | nr_pred = np.array([achrom.calculate_RT(s, RC) for s in ns]) 1126 | 1127 | rt_diff_tmp = nr_pred - nr 1128 | 1129 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp) 1130 | 1131 | 1132 | logger.info('First-stage calibrated RT shift: %.3f min', XRT_shift) 1133 | logger.info('First-stage calibrated RT sigma: %.3f min', XRT_sigma) 1134 | 1135 | RT_sigma = XRT_sigma 1136 | 1137 | else: 1138 | logger.info('No matches found') 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | if args['ts']: 1145 | 1146 | 1147 | if args['es']: 1148 | logger.info('Use extra Stage 2 search...') 1149 | 1150 | qin = list(set(resdict['seqs'])) 1151 | qout = [] 1152 | pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True) 1153 | 1154 | rt_pred = np.array([pepdict[s] for s in resdict['seqs']]) 1155 | rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift 1156 | e_all = (rt_diff) ** 2 / (RT_sigma ** 2) 1157 | r = 9.0 1158 | e_ind = e_all <= r 1159 | resdict = filter_results(resdict, e_ind) 1160 | 1161 | 1162 | 1163 | 1164 | e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration 1165 | resdict2 = filter_results(resdict, e_ind) 1166 | 1167 | 1168 | e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration 1169 | resdict2 = filter_results(resdict2, e_ind) 1170 | 1171 | e_ind = resdict2['mods'] == 0 1172 | resdict2 = filter_results(resdict2, e_ind) 1173 | 1174 | 1175 | if args['mc'] > 0: 1176 | e_ind = resdict2['mc'] == 0 1177 | resdict2 = filter_results(resdict2, e_ind) 1178 | 1179 | p1 = set(resdict2['seqs']) 1180 | 1181 | 1182 | # Calculate basic protein scores including homologues 1183 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False) 1184 | # Calculate basic protein scores excluding homologues 1185 | prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p) 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, 1192 | full_output=True) 1193 | 1194 | identified_proteins = 0 1195 | 1196 | for x in filtered_prots: 1197 | identified_proteins += 1 1198 | logger.info('Stage 2 search: identified proteins = %d', identified_proteins) 1199 | if identified_proteins <= 25: 1200 | logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...') 1201 | filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25] 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= 1 1210 | resdict2 = filter_results(resdict, e_ind) 1211 | 1212 | e_ind = resdict2['mods'] == 0 1213 | resdict2 = filter_results(resdict2, e_ind) 1214 | 1215 | if args['mc'] > 0: 1216 | e_ind = resdict2['mc'] == 0 1217 | resdict2 = filter_results(resdict2, e_ind) 1218 | 1219 | 1220 | true_seqs = [] 1221 | true_rt = [] 1222 | true_isotopes = [] 1223 | true_prots = set(x[0] for x in filtered_prots) 1224 | for pep, proteins in pept_prot.items(): 1225 | if any(protein in true_prots for protein in proteins): 1226 | true_seqs.append(pep) 1227 | e_ind = np.in1d(resdict2['seqs'], true_seqs) 1228 | 1229 | 1230 | true_seqs = resdict2['seqs'][e_ind] 1231 | 1232 | true_rt.extend(np.array([rts[iorig] for iorig in resdict2['iorig']])[e_ind]) 1233 | true_rt = np.array(true_rt) 1234 | true_isotopes.extend(np.array([Isotopes[iorig] for iorig in resdict2['iorig']])[e_ind]) 1235 | true_isotopes = np.array(true_isotopes) 1236 | 1237 | e_all = abs(resdict2['md'][e_ind] - mass_shift) / (mass_sigma) 1238 | zs_all_tmp = e_all ** 2 1239 | 1240 | zs_all_tmp += (true_isotopes.max() - true_isotopes) * 100 1241 | 1242 | e_ind = np.argsort(zs_all_tmp) 1243 | true_seqs = true_seqs[e_ind] 1244 | true_rt = true_rt[e_ind] 1245 | 1246 | idx_limit = len(true_seqs) 1247 | cnt_pep = len(set(true_seqs)) 1248 | while cnt_pep > 2500: 1249 | idx_limit -= 1 1250 | cnt_pep = len(set(true_seqs[:idx_limit])) 1251 | true_seqs = true_seqs[:idx_limit] 1252 | true_rt = true_rt[:idx_limit] 1253 | 1254 | per_ind = np.random.RandomState(seed=SEED).permutation(len(true_seqs)) 1255 | true_seqs = true_seqs[per_ind] 1256 | true_rt = true_rt[per_ind] 1257 | 1258 | best_seq = defaultdict(list) 1259 | newseqs = [] 1260 | newRTs = [] 1261 | for seq, RT in zip(true_seqs, true_rt): 1262 | best_seq[seq].append(RT) 1263 | for k, v in best_seq.items(): 1264 | newseqs.append(k) 1265 | newRTs.append(np.median(v)) 1266 | true_seqs = np.array(newseqs) 1267 | true_rt = np.array(newRTs) 1268 | 1269 | if calib_path: 1270 | df1 = pd.read_csv(calib_path, sep='\t') 1271 | true_seqs = df1['peptide'].values 1272 | true_rt = df1['RT exp'].values 1273 | 1274 | ll = len(true_seqs) 1275 | true_seqs2 = true_seqs[int(ll/2):] 1276 | true_rt2 = true_rt[int(ll/2):] 1277 | true_seqs = true_seqs[:int(ll/2)] 1278 | true_rt = true_rt[:int(ll/2)] 1279 | 1280 | else: 1281 | 1282 | ll = len(true_seqs) 1283 | 1284 | true_seqs2 = true_seqs[int(ll/2):] 1285 | true_rt2 = true_rt[int(ll/2):] 1286 | true_seqs = true_seqs[:int(ll/2)] 1287 | true_rt = true_rt[:int(ll/2)] 1288 | 1289 | ns = true_seqs 1290 | nr = true_rt 1291 | ns2 = true_seqs2 1292 | nr2 = true_rt2 1293 | 1294 | 1295 | 1296 | 1297 | else: 1298 | ns = np.array(ns) 1299 | nr = np.array(nr) 1300 | idx = np.abs((rt_diff_tmp) - XRT_shift) <= 3 * XRT_sigma 1301 | ns = ns[idx] 1302 | nr = nr[idx] 1303 | 1304 | rt_diff_tmp2 = nr2_pred - nr2 1305 | ns2 = np.array(ns2) 1306 | nr2 = np.array(nr2) 1307 | idx = np.abs((rt_diff_tmp2) - XRT_shift) <= 3 * XRT_sigma 1308 | ns2 = ns2[idx] 1309 | nr2 = nr2[idx] 1310 | 1311 | logger.info('Second-stage peptides used for RT prediction: %d', len(ns)) 1312 | 1313 | if deeplc_path: 1314 | 1315 | dlc = DeepLC(verbose=False, batch_num=args['deeplc_batch_num'], path_model=path_model, write_library=write_library, use_library=path_to_lib, pygam_calibration=False) 1316 | 1317 | 1318 | df_for_calib = pd.DataFrame({ 1319 | 'seq': ns2, 1320 | 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns2], 1321 | 'tr': nr2, 1322 | }) 1323 | 1324 | df_for_check = pd.DataFrame({ 1325 | 'seq': ns, 1326 | 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns], 1327 | 'tr': nr, 1328 | }) 1329 | 1330 | 1331 | try: 1332 | dlc.calibrate_preds(seq_df=df_for_calib, check_df=df_for_check) 1333 | except: 1334 | dlc.calibrate_preds(seq_df=df_for_calib) 1335 | 1336 | df_for_check['pr'] = dlc.make_preds(seq_df=df_for_check) 1337 | 1338 | rt_diff_tmp = df_for_check['pr'] - df_for_check['tr'] 1339 | 1340 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp) 1341 | 1342 | else: 1343 | 1344 | RC = achrom.get_RCs_vary_lcp(ns, nr, metric='mae') 1345 | RT_pred = np.array([achrom.calculate_RT(s, RC) for s in ns]) 1346 | 1347 | rt_diff_tmp = RT_pred - nr 1348 | 1349 | XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp) 1350 | 1351 | RT_sigma = XRT_sigma 1352 | 1353 | logger.info('Second-stage calibrated RT shift: %.3f min', XRT_shift) 1354 | logger.info('Second-stage calibrated RT sigma: %.3f min', XRT_sigma) 1355 | 1356 | out_log.write('Calibrated RT shift: %.3f min\n' % (XRT_shift, )) 1357 | out_log.write('Calibrated RT sigma: %.3f min\n' % (XRT_sigma, )) 1358 | 1359 | p1 = set(resdict['seqs']) 1360 | 1361 | n = args['nproc'] 1362 | 1363 | 1364 | def divide_chunks(l, n): 1365 | for i in range(0, len(l), n): 1366 | yield l[i:i + n] 1367 | 1368 | if deeplc_path: 1369 | 1370 | pepdict = dict() 1371 | 1372 | if args['save_calib']: 1373 | with open(base_out_name + '_calib.tsv', 'w') as output: 1374 | output.write('peptide\tRT exp\n') 1375 | for seq, RT in zip(ns, nr): 1376 | output.write('%s\t%s\n' % (seq, str(RT))) 1377 | for seq, RT in zip(ns2, nr2): 1378 | output.write('%s\t%s\n' % (seq, str(RT))) 1379 | 1380 | seqs_batch = list(p1) 1381 | 1382 | df_for_check = pd.DataFrame({ 1383 | 'seq': seqs_batch, 1384 | 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in seqs_batch], 1385 | }) 1386 | 1387 | df_for_check['pr'] = dlc.make_preds(seq_df=df_for_check) 1388 | 1389 | pepdict_batch = df_for_check.set_index('seq')['pr'].to_dict() 1390 | 1391 | pepdict.update(pepdict_batch) 1392 | 1393 | 1394 | else: 1395 | 1396 | qin = list(p1) 1397 | qout = [] 1398 | pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True) 1399 | 1400 | rt_pred = np.array([pepdict[s] for s in resdict['seqs']]) 1401 | # rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred 1402 | rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift 1403 | # rt_diff = resdict['rt'] - rt_pred 1404 | e_all = (rt_diff) ** 2 / (RT_sigma ** 2) 1405 | r = 9.0 1406 | e_ind = e_all <= r 1407 | 1408 | resdict = filter_results(resdict, e_ind) 1409 | rt_diff = rt_diff[e_ind] 1410 | rt_pred = rt_pred[e_ind] 1411 | 1412 | 1413 | logger.info('RT prediction was finished') 1414 | 1415 | 1416 | with open(base_out_name + '_protsN.tsv', 'w') as output: 1417 | output.write('dbname\ttheor peptides\n') 1418 | for k, v in protsN.items(): 1419 | output.write('\t'.join((k, str(v))) + '\n') 1420 | 1421 | with open(base_out_name + '_PFMs.tsv', 'w') as output: 1422 | output.write('sequence\tmass diff\tRT diff\tpeak_id\tIntensity\tIntensitySum\tnScans\tnIsotopes\tproteins\tm/z\tRT\taveragineCorr\tcharge\tion_mobility\n') 1423 | for seq, md, rtd, iorig in zip(resdict['seqs'], resdict['md'], rt_diff, resdict['iorig']): 1424 | peak_id = ids[iorig] 1425 | I = Is[iorig] 1426 | Isum = Isums[iorig] 1427 | nScans = Scans[iorig] 1428 | nIsotopes = Isotopes[iorig] 1429 | mzr = mzraw[iorig] 1430 | rtr = rts[iorig] 1431 | av = avraw[iorig] 1432 | ch = charges[iorig] 1433 | im = imraw[iorig] 1434 | output.write('\t'.join((seq, str(md), str(rtd), str(peak_id), str(I), str(Isum), str(nScans), str(nIsotopes), ';'.join(pept_prot[seq]), str(mzr), str(rtr), str(av), str(ch), str(im))) + '\n') 1435 | 1436 | mass_diff = (resdict['md'] - mass_shift) / (mass_sigma) 1437 | 1438 | rt_diff = (np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift) / RT_sigma 1439 | 1440 | ## Will it override the settings provided in args? :k 1441 | prefix = 'DECOY_' 1442 | isdecoy = lambda x: x[0].startswith(prefix) 1443 | isdecoy_key = lambda x: x.startswith(prefix) 1444 | escore = lambda x: -x[1] 1445 | 1446 | param_grid = { 1447 | 'boosting_type': ['gbdt', ], 1448 | 'num_leaves': list(range(10, 1000)), 1449 | 'learning_rate': list(np.logspace(np.log10(0.001), np.log10(0.3), base = 10, num = 1000)), 1450 | 'min_child_samples': list(range(1, 1000, 5)), 1451 | 'reg_alpha': list(np.linspace(0, 1)), 1452 | 'reg_lambda': list(np.linspace(0, 1)), 1453 | 'colsample_bytree': list(np.linspace(0.01, 1, 100)), 1454 | 'subsample': list(np.linspace(0.01, 1, 100)), 1455 | 'is_unbalance': [True, False], 1456 | 'metric': ['rmse', ], 1457 | 'verbose': [-1, ], 1458 | 'num_threads': [args['nproc'], ], 1459 | # 'use_quantized_grad': [True, ] 1460 | } 1461 | 1462 | def get_X_array(df, feature_columns): 1463 | return df.loc[:, feature_columns].values 1464 | 1465 | def get_Y_array_pfms(df): 1466 | return df.loc[:, 'decoy'].values 1467 | 1468 | def get_features_pfms(dataframe): 1469 | feature_columns = dataframe.columns 1470 | columns_to_remove = [] 1471 | banned_features = { 1472 | 'iorig', 1473 | 'ids', 1474 | 'seqs', 1475 | 'decoy', 1476 | 'preds', 1477 | 'av', 1478 | 'Is', 1479 | # 'Scans', 1480 | 'proteins', 1481 | 'peptide', 1482 | 'md', 1483 | 'qpreds', 1484 | 'decoy1', 1485 | 'decoy2', 1486 | 'top_25_targets', 1487 | 'G', 1488 | } 1489 | 1490 | for feature in feature_columns: 1491 | if feature in banned_features: 1492 | columns_to_remove.append(feature) 1493 | feature_columns = feature_columns.drop(columns_to_remove) 1494 | return feature_columns 1495 | 1496 | def objective_pfms(df, hyperparameters, iteration, threshold=0): 1497 | """Objective function for grid and random search. Returns 1498 | the cross validation score from a set of hyperparameters.""" 1499 | 1500 | all_res = [] 1501 | 1502 | for group_val in range(3): 1503 | 1504 | mask = df['G'] == group_val 1505 | test_df = df[mask] 1506 | test_ids = set(test_df['ids']) 1507 | train_df = df[(~mask) & (df['ids'].apply(lambda x: x not in test_ids))] 1508 | 1509 | 1510 | 1511 | feature_columns = get_features_pfms(df) 1512 | model = get_cat_model_final_pfms(train_df[~train_df['decoy2']], hyperparameters, feature_columns) 1513 | 1514 | df.loc[mask, 'preds'] = model.predict(get_X_array(df.loc[mask, :], feature_columns)) 1515 | 1516 | test_df = df[mask] 1517 | 1518 | fpr, tpr, thresholds = metrics.roc_curve(get_Y_array_pfms(test_df[~test_df['decoy2']]), test_df[~test_df['decoy2']]['preds']) 1519 | shr_v = metrics.auc(fpr, tpr) 1520 | 1521 | all_res.append(shr_v) 1522 | 1523 | if shr_v < threshold: 1524 | all_res = [0, ] 1525 | break 1526 | 1527 | shr_v = np.mean(all_res) 1528 | 1529 | return np.array([shr_v, hyperparameters, iteration, all_res], dtype=object) 1530 | 1531 | def random_search_pfms(df, param_grid, out_file, max_evals): 1532 | """Random search for hyperparameter optimization. 1533 | Writes result of search to csv file every search iteration.""" 1534 | 1535 | threshold = 0 1536 | 1537 | 1538 | 1539 | # Dataframe for results 1540 | results = pd.DataFrame(columns = ['sharpe', 'params', 'iteration', 'all_res'], 1541 | index = list(range(max_evals))) 1542 | for i in range(max_evals): 1543 | 1544 | # Choose random hyperparameters 1545 | random_params = {k: np.random.RandomState().choice(v, 1)[0] for k, v in param_grid.items()} 1546 | 1547 | # Evaluate randomly selected hyperparameters 1548 | eval_results = objective_pfms(df, random_params, i, threshold) 1549 | results.loc[i, :] = eval_results 1550 | 1551 | threshold = max(threshold, np.mean(eval_results[3]) - 3 * np.std(eval_results[3])) 1552 | 1553 | # open connection (append option) and write results 1554 | of_connection = open(out_file, 'a') 1555 | writer = csv.writer(of_connection) 1556 | writer.writerow(eval_results) 1557 | of_connection.close() 1558 | 1559 | # Sort with best score on top 1560 | results.sort_values('sharpe', ascending = False, inplace = True) 1561 | results.reset_index(inplace = True) 1562 | 1563 | return results 1564 | 1565 | 1566 | 1567 | def get_cat_model_pfms(df, hyperparameters, feature_columns, train, test): 1568 | feature_columns = list(feature_columns) 1569 | dtrain = lgb.Dataset(get_X_array(train, feature_columns), get_Y_array_pfms(train), feature_name=feature_columns, free_raw_data=False) 1570 | dvalid = lgb.Dataset(get_X_array(test, feature_columns), get_Y_array_pfms(test), feature_name=feature_columns, free_raw_data=False) 1571 | np.random.seed(SEED) 1572 | evals_result = {} 1573 | model = lgb.train(hyperparameters, dtrain, num_boost_round=500, valid_sets=(dvalid,), valid_names=('valid',), verbose_eval=False, 1574 | early_stopping_rounds=10, evals_result=evals_result) 1575 | return model 1576 | 1577 | def get_cat_model_final_pfms(df, hyperparameters, feature_columns): 1578 | feature_columns = list(feature_columns) 1579 | train = df 1580 | dtrain = lgb.Dataset(get_X_array(train, feature_columns), get_Y_array_pfms(train), feature_name=feature_columns, free_raw_data=False) 1581 | np.random.seed(SEED) 1582 | model = lgb.train(hyperparameters, dtrain, num_boost_round=100) 1583 | 1584 | return model 1585 | 1586 | 1587 | logger.info('Prepare ML features') 1588 | 1589 | df1 = pd.DataFrame() 1590 | for k in resdict.keys(): 1591 | df1[k] = resdict[k] 1592 | 1593 | df1['ids'] = ids[df1['iorig'].values] 1594 | df1['Is'] = Is[df1['iorig'].values] 1595 | df1['Scans'] = Scans[df1['iorig'].values] 1596 | df1['Isotopes'] = Isotopes[df1['iorig'].values] 1597 | df1['mzraw'] = mzraw[df1['iorig'].values] 1598 | df1['rt'] = rts[df1['iorig'].values] 1599 | df1['av'] = avraw[df1['iorig'].values] 1600 | df1['ch'] = charges[df1['iorig'].values] 1601 | df1['im'] = imraw[df1['iorig'].values] 1602 | 1603 | df1['mass_diff'] = mass_diff 1604 | df1['rt_diff'] = rt_diff 1605 | df1['decoy'] = df1['seqs'].apply(lambda x: all(z.startswith(prefix) for z in pept_prot[x])) 1606 | 1607 | df1['peptide'] = df1['seqs'] 1608 | # mass_dict = {} 1609 | # pI_dict = {} 1610 | # charge_dict = {} 1611 | # for pep in set(df1['peptide']): 1612 | # try: 1613 | # mass_dict[pep] = mass.fast_mass2(pep) 1614 | # pI_dict[pep] = electrochem.pI(pep) 1615 | # charge_dict[pep] = electrochem.charge(pep, pH=7.0) 1616 | # except: 1617 | # mass_dict[pep] = 0 1618 | # pI_dict[pep] = 0 1619 | # charge_dict[pep] = 0 1620 | 1621 | df1['plen'] = df1['peptide'].apply(lambda z: len(z)) 1622 | 1623 | 1624 | # df1['mass'] = df1['peptide'].apply(lambda x: mass_dict[x]) 1625 | # df1['pI'] = df1['peptide'].apply(lambda x: pI_dict[x]) 1626 | # df1['charge_theor'] = df1['peptide'].apply(lambda x: charge_dict[x]) 1627 | 1628 | for aa in mass.std_aa_mass: 1629 | df1['c_%s' % (aa, )] = df1['peptide'].apply(lambda x: x.count(aa)) 1630 | df1['c_DP'] = df1['peptide'].apply(lambda x: x.count('DP')) 1631 | df1['c_KP'] = df1['peptide'].apply(lambda x: x.count('KP')) 1632 | df1['c_RP'] = df1['peptide'].apply(lambda x: x.count('RP')) 1633 | 1634 | df1['rt_diff_abs'] = df1['rt_diff'].abs() 1635 | df1['rt_diff_abs_pdiff'] = df1['rt_diff_abs'] - df1.groupby('ids')['rt_diff_abs'].transform('median') 1636 | df1['rt_diff_abs_pnorm'] = df1['rt_diff_abs'] / (df1.groupby('ids')['rt_diff_abs'].transform('sum') + 1e-2) 1637 | 1638 | df1['mass_diff_abs'] = df1['mass_diff'].abs() 1639 | df1['mass_diff_abs_pdiff'] = df1['mass_diff_abs'] - df1.groupby('ids')['mass_diff_abs'].transform('median') 1640 | df1['mass_diff_abs_pnorm'] = df1['mass_diff_abs'] / (df1.groupby('ids')['mass_diff_abs'].transform('sum') + 1e-2) 1641 | 1642 | df1['id_count'] = df1.groupby('ids')['mass_diff'].transform('count') 1643 | 1644 | #limited CSD information 1645 | if args['csd'] == 1: 1646 | logging.info('Using limited CSD') 1647 | df1['seq_count'] = df1.groupby('peptide')['mass_diff'].transform('count') 1648 | df1['charge_count'] = df1.groupby('peptide')['ch'].transform('nunique') 1649 | df1['im_count'] = df1.groupby('peptide')['im'].transform('nunique') 1650 | 1651 | #complete CSD information 1652 | elif args['csd'] == 2: 1653 | logging.info('Using complete CSD') 1654 | #imports and functions: should be global? 1655 | import json 1656 | from keras import models 1657 | 1658 | os.environ["CUDA_VISIBLE_DEVICES"] = "" 1659 | 1660 | #constants 1661 | PROTON = 1.00727646677 1662 | TINY = 1e-6 1663 | 1664 | psi_to_single_ptm = {'[Acetyl]-': 'B', 1665 | '[Carbamidomethyl]': '', 1666 | 'M[Oxidation]': 'J', 1667 | 'S[Phosphorylation]': 'X', 1668 | 'T[Phosphorylation]': 'O', 1669 | 'Y[Phosphorylation]': 'U'} 1670 | #functions 1671 | def reshapeOneHot(X): 1672 | X = np.dstack(X) 1673 | X = np.swapaxes(X, 1, 2) 1674 | X = np.swapaxes(X, 0, 1) 1675 | return X 1676 | 1677 | def get_single_ptm_code(psi_sequence): 1678 | sequence = psi_sequence 1679 | for ptm in psi_to_single_ptm: 1680 | sequence = sequence.replace(ptm, psi_to_single_ptm[ptm]) 1681 | return sequence 1682 | 1683 | def one_hot_encode_peptide(psi_sequence, MAX_LENGTH = 40): 1684 | peptide = get_single_ptm_code(psi_sequence) 1685 | if len(peptide) > MAX_LENGTH: 1686 | print('Peptide length is larger than maximal length of ', str(MAX_LENGTH)) 1687 | return ['', None] 1688 | else: 1689 | AA_vocabulary = 'KRPTNAQVSGILCMJHFYWEDBXOU'#B: acetyl; J: oxidized Met; O: PhosphoT; X: PhosphoS; U: PhosphoY 1690 | 1691 | one_hot_peptide = np.zeros((len(peptide), len(AA_vocabulary))) 1692 | 1693 | for j in range(0, len(peptide)): 1694 | try: 1695 | aa = peptide[j] 1696 | one_hot_peptide[j, AA_vocabulary.index(aa)] = 1 1697 | except: 1698 | print(f'WARN: "{aa}" is not in vocabulary; it will be skipped') 1699 | 1700 | no_front_paddings = int((MAX_LENGTH - len(peptide))/2) 1701 | peptide_front_paddings = np.zeros((no_front_paddings, one_hot_peptide.shape[1])) 1702 | 1703 | no_back_paddings = MAX_LENGTH - len(peptide) - no_front_paddings 1704 | peptide_back_paddings = np.zeros((no_back_paddings, one_hot_peptide.shape[1])) 1705 | 1706 | full_one_hot_peptide = np.vstack((peptide_front_paddings, one_hot_peptide, peptide_back_paddings)) 1707 | 1708 | return peptide, full_one_hot_peptide 1709 | 1710 | def expand_charges(row, min_charge, max_charge, mode=1): 1711 | charge = np.arange(min_charge, max_charge + 1).reshape(-1, 1) 1712 | mz = row['massCalib']/charge + PROTON * mode 1713 | index = [f'{int(row["id"])}_{z[0]}' for z in charge] 1714 | result = pd.DataFrame(mz, columns=['mz'], index=index) 1715 | result['z'] = charge 1716 | result['rt_start'] = row['rtStart'] 1717 | result['rt_end'] = row['rtEnd'] 1718 | if 'FAIMS' in row.keys(): 1719 | result['scan_filter'] = f'ms cv={row["FAIMS"]}' 1720 | 1721 | return result 1722 | 1723 | def AUC(x, y): 1724 | return (0.5 * (y[1:] + y[:-1]) * (x[1:] - x[:-1])).sum() 1725 | 1726 | def aggregate_csd(g): 1727 | result = {k: v for k,v in zip(g['z'], g['area'])} 1728 | result['pepid'] = g['pepid'].iloc[0] 1729 | return result 1730 | 1731 | #imports and functions: end 1732 | 1733 | logger.info('Running charge-state distribution prediction') 1734 | 1735 | ## extracting feature CSDs 1736 | 1737 | faims_mode = np.any(df_features['FAIMS'] != 0) 1738 | logger.debug(f'FAIMS mode: {faims_mode}') 1739 | 1740 | #find all features that has been maped to sequence 1741 | if faims_mode: 1742 | used_features = df_features.loc[df1['ids'].unique(), ['massCalib', 'rtStart', 'rtEnd', 'FAIMS']].copy() 1743 | else: 1744 | used_features = df_features.loc[df1['ids'].unique(), ['massCalib', 'rtStart', 'rtEnd']].copy() 1745 | used_features['id'] = used_features.index.astype(int) 1746 | 1747 | logger.debug(f'{used_features.shape[0]} features has been mapped to peptide sequence') 1748 | 1749 | #expand each feature to 1 - 4 charge states 1750 | mz_table = pd.concat(used_features.apply(expand_charges, args=(1, 4), axis=1).tolist()) 1751 | #need better way to filter impossible m/z-s 1752 | mzmin = np.floor(mzraw.min()) 1753 | mzmax = np.ceil(mzraw.max()) 1754 | mz_table = mz_table.query('mz >= @mzmin & mz <= @mzmax').copy() 1755 | 1756 | logger.debug(f'{mz_table.shape[0]} XICs scheduled for extraction') 1757 | 1758 | #writing JSON input for TRFP 1759 | mz_table['tolerance'] = 5 1760 | mz_table['tolerance_unit'] = 'ppm' 1761 | mz_table['comment'] = mz_table.index 1762 | trfp_xic_in = os.path.join(tempfile.gettempdir(), os.urandom(24).hex()) 1763 | with open(trfp_xic_in, 'w') as out: 1764 | json.dump(mz_table.drop(['z'], axis=1).apply(lambda row: {k:v for k,v in zip(row.index, row)}, axis=1).tolist(), out) 1765 | 1766 | #running TRFP to extract XICs (needs RAW file) 1767 | trfp_xic_out = os.path.join(tempfile.gettempdir(), os.urandom(24).hex()) 1768 | 1769 | trfp_params = [args['trfp'], 'xic', '-i', f'{base_out_name[:-9]}.raw', '-b', trfp_xic_out, '-j', trfp_xic_in] 1770 | 1771 | if os.name != 'nt' and os.path.splitext(args['trfp'])[1].lower() == '.exe': 1772 | trfp_params.insert(0, 'mono') 1773 | 1774 | logger.info('Running XIC extraction') 1775 | 1776 | try: 1777 | subprocess.check_output(trfp_params) 1778 | os.remove(trfp_xic_in) 1779 | except subprocess.CalledProcessError as ex: 1780 | logger.error(f'TRFP execution failed\nOutput:\n{ex.output}') 1781 | raise Exception('TRFP XIC extraction failed') 1782 | 1783 | #read TRFP output 1784 | with open(trfp_xic_out, 'r') as xic_input: 1785 | chromatograms = json.load(xic_input) 1786 | 1787 | os.remove(trfp_xic_out) 1788 | 1789 | #calculate AUCs 1790 | mz_table['area'] = [AUC(np.array(cr['RetentionTimes']), np.array(cr['Intensities'])) for cr in chromatograms['Content']] 1791 | mz_table['pepid'] = mz_table['comment'].str.split('_', expand=True).iloc[:, 0].astype(int) 1792 | 1793 | #building CSD table 1794 | csd_table = pd.DataFrame(mz_table.groupby('pepid').apply(aggregate_csd).tolist()) 1795 | csd_table.rename(columns={k:v for k,v in zip(csd_table.columns, csd_table.columns.astype(str))}, inplace=True) 1796 | csd_table = csd_table.reindex(columns=sorted(csd_table.columns)) 1797 | 1798 | #add CSD information 1799 | used_features = used_features.join(csd_table.set_index('pepid', verify_integrity=True), how='left') 1800 | 1801 | #map CSD of features for the PFM information 1802 | pfm_csd = used_features.reindex(df1['ids'])[['1', '2', '3', '4']].values 1803 | 1804 | ## predict CSDs for all sequences using the model 1805 | 1806 | #loading model 1807 | CSD_model = models.load_model( 1808 | os.path.normpath(os.path.join(os.path.dirname(__file__), 'models', 'CSD_model_LCMSMS.hdf5')), 1809 | compile=False) 1810 | CSD_model.compile() 1811 | 1812 | #using only unique sequences for prediction 1813 | unique_sequences = pd.DataFrame(df1.drop_duplicates('seqs')['seqs']) 1814 | CSD_X = reshapeOneHot(unique_sequences['seqs'].apply(lambda s: one_hot_encode_peptide(s)[1]).values) 1815 | 1816 | logger.debug(f'{unique_sequences.shape[0]} unique peptide sequence detected') 1817 | logger.info('Running charge-state distribution prediction model') 1818 | 1819 | #prediction 1820 | CSD_pred = CSD_model.predict(CSD_X, batch_size=2048) 1821 | unique_sequences[['1', '2', '3', '4']] = CSD_pred[:, :4] 1822 | pfm_csd_pred = unique_sequences.set_index('seqs').reindex(df1['seqs']).values 1823 | 1824 | #masking impossible m/z-s in prediction 1825 | mask = np.isnan(pfm_csd) 1826 | pfm_csd[mask] = 0 1827 | pfm_csd_pred[mask] = 0 1828 | 1829 | #normalizing CSDs 1830 | pfm_csd = pfm_csd / (pfm_csd.sum(axis=1) + TINY).reshape(-1, 1) 1831 | pfm_csd_pred = pfm_csd_pred / (pfm_csd_pred.sum(axis=1) + TINY).reshape(-1, 1) 1832 | 1833 | #spectrum angle feature 1834 | df1['z_angle'] = 1 - 2 * np.arccos((pfm_csd * pfm_csd_pred).sum(axis=1) /\ 1835 | (np.linalg.norm(pfm_csd, 2, axis=1) * np.linalg.norm(pfm_csd_pred, 2, axis=1) + TINY)) / np.pi 1836 | 1837 | #adding average_charge 1838 | pfm_csd = np.concatenate([pfm_csd, (pfm_csd * np.arange(1, 5).reshape(1, 4)).sum(axis=1).reshape(-1, 1)], axis=1) 1839 | pfm_csd_pred = np.concatenate([pfm_csd_pred, (pfm_csd_pred * np.arange(1, 5).reshape(1, 4)).sum(axis=1).reshape(-1, 1)], axis=1) 1840 | 1841 | #prediction error features for individual intensities and average charge 1842 | z_deltas = pfm_csd - pfm_csd_pred 1843 | df1[[f'z{z}_err' for z in '1234a']] = (z_deltas - z_deltas.mean(axis=0)) / z_deltas.std(axis=0).reshape(1, -1) 1844 | 1845 | logger.info('Charge-state distribution features added') 1846 | 1847 | p1 = set(resdict['seqs']) 1848 | 1849 | prots_spc2 = defaultdict(set) 1850 | for pep, proteins in pept_prot.items(): 1851 | if pep in p1: 1852 | for protein in proteins: 1853 | prots_spc2[protein].add(pep) 1854 | 1855 | for k in protsN: 1856 | if k not in prots_spc2: 1857 | prots_spc2[k] = set([]) 1858 | prots_spc = dict((k, len(v)) for k, v in prots_spc2.items()) 1859 | 1860 | names_arr = np.array(list(prots_spc.keys())) 1861 | v_arr = np.array(list(prots_spc.values())) 1862 | n_arr = np.array([protsN[k] for k in prots_spc]) 1863 | 1864 | top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)] 1865 | top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)] 1866 | p = np.mean(top100decoy_score) / np.mean(top100decoy_N) 1867 | logger.info('Stage 3 search: probability of random match for theoretical peptide = %.3f', p) 1868 | 1869 | prots_spc = dict() 1870 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 1871 | for idx, k in enumerate(names_arr): 1872 | prots_spc[k] = all_pvals[idx] 1873 | 1874 | target_prots_25_fdr = set([x[0] for x in aux.filter(prots_spc.items(), fdr=0.25, key=escore, is_decoy=isdecoy, remove_decoy=False, formula=1, full_output=True, correction=0)]) 1875 | df1['proteins'] = df1['seqs'].apply(lambda x: ';'.join(pept_prot[x])) 1876 | df1['decoy2'] = df1['decoy'] 1877 | df1['decoy'] = df1['proteins'].apply(lambda x: all(z not in target_prots_25_fdr for z in x.split(';'))) 1878 | df1['top_25_targets'] = df1['decoy'] 1879 | 1880 | 1881 | if len(target_prots_25_fdr) <= 25: 1882 | logger.info('Low number of identified proteins, turning off LightGBM...') 1883 | filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25] 1884 | skip_ml = 1 1885 | else: 1886 | skip_ml = 0 1887 | 1888 | if args['ml'] and not skip_ml: 1889 | 1890 | logger.info('Start Machine Learning on PFMs...') 1891 | 1892 | MAX_EVALS = 25 1893 | 1894 | out_file = os.path.join(tempfile.gettempdir(), os.urandom(24).hex()) 1895 | of_connection = open(out_file, 'w') 1896 | writer = csv.writer(of_connection) 1897 | 1898 | headers = ['auc', 'params', 'iteration', 'all_res'] 1899 | writer.writerow(headers) 1900 | of_connection.close() 1901 | 1902 | all_id_list = list(set(df1[df1['decoy']]['peptide'])) 1903 | np.random.RandomState(seed=SEED).shuffle(all_id_list) 1904 | seq_gmap = {} 1905 | for idx, split in enumerate(np.array_split(all_id_list, 3)): 1906 | for id_ftr in split: 1907 | seq_gmap[id_ftr] = idx 1908 | 1909 | all_id_list = list(set(df1[~df1['decoy']]['peptide'])) 1910 | np.random.RandomState(seed=SEED).shuffle(all_id_list) 1911 | for idx, split in enumerate(np.array_split(all_id_list, 3)): 1912 | for id_ftr in split: 1913 | seq_gmap[id_ftr] = idx 1914 | 1915 | 1916 | 1917 | df1['G'] = df1['peptide'].apply(lambda x: seq_gmap[x]) 1918 | 1919 | # train = df1[~df1['decoy']] 1920 | # train_extra = df1[df1['decoy']] 1921 | # train_extra = train_extra.sample(frac=min(1.0, len(train)/len(train_extra))) 1922 | # train = pd.concat([train, train_extra], ignore_index=True).reset_index(drop=True) 1923 | # random_results = random_search_pfms(train, param_grid, out_file, MAX_EVALS) 1924 | 1925 | random_results = random_search_pfms(df1, param_grid, out_file, MAX_EVALS) 1926 | 1927 | random_results = pd.read_csv(out_file) 1928 | random_results = random_results[random_results['auc'] != 'auc'] 1929 | random_results['params'] = random_results['params'].apply(lambda x: ast.literal_eval(x)) 1930 | convert_dict = {'auc': float, 1931 | } 1932 | random_results = random_results.astype(convert_dict) 1933 | 1934 | 1935 | bestparams = random_results.sort_values(by='auc',ascending=False)['params'].values[0] 1936 | 1937 | bestparams['num_threads'] = args['nproc'] 1938 | 1939 | 1940 | 1941 | for group_val in range(3): 1942 | 1943 | mask = df1['G'] == group_val 1944 | test_df = df1[mask] 1945 | test_ids = set(test_df['ids']) 1946 | train_df = df1[(~mask) & (df1['ids'].apply(lambda x: x not in test_ids))] 1947 | 1948 | # mask2 = train['G'] != group_val 1949 | 1950 | # train_df = train[(mask2) & (train['ids'].apply(lambda x: x not in test_ids))] 1951 | 1952 | 1953 | feature_columns = list(get_features_pfms(train_df)) 1954 | model = get_cat_model_final_pfms(train_df[~train_df['decoy2']], bestparams, feature_columns) 1955 | 1956 | df1.loc[mask, 'preds'] = rankdata(model.predict(get_X_array(test_df, feature_columns)), method='ordinal') / sum(mask) 1957 | 1958 | 1959 | else: 1960 | df1['preds'] = np.power(df1['mass_diff'], 2) + np.power(df1['rt_diff'], 2) 1961 | 1962 | 1963 | 1964 | 1965 | df1['qpreds'] = pd.qcut(df1['preds'], 50, labels=range(50)) 1966 | 1967 | df1['decoy'] = df1['decoy2'] 1968 | 1969 | 1970 | df1u = df1.sort_values(by='preds') 1971 | df1u = df1u.drop_duplicates(subset='seqs') 1972 | 1973 | qval_ok = 0 1974 | for qval_cur in range(50): 1975 | df1ut = df1u[df1u['qpreds'] == qval_cur] 1976 | decoy_ratio = df1ut['decoy'].sum() / len(df1ut) 1977 | if decoy_ratio < ml_correction: 1978 | qval_ok = qval_cur 1979 | else: 1980 | break 1981 | logger.info('%d %% of PFMs were removed from protein scoring after Machine Learning', (100 - (qval_ok+1)*2)) 1982 | 1983 | #CHECK merge conflict was from here 1984 | df1u = df1u[df1u['qpreds'] <= qval_ok]#.copy() 1985 | 1986 | df1u['qpreds'] = pd.qcut(df1u['preds'], 10, labels=range(10)) 1987 | 1988 | qdict = df1u.set_index('seqs').to_dict()['qpreds'] 1989 | 1990 | df1['qpreds'] = df1['seqs'].apply(lambda x: qdict.get(x, 11)) 1991 | 1992 | # df1un = df1u[df1u['qpreds'] <= qval_ok].copy() 1993 | # df1un['qpreds'] = pd.qcut(df1un['preds'], 10, labels=range(10)) 1994 | 1995 | # qdict = df1un.set_index('seqs').to_dict()['qpreds'] 1996 | 1997 | 1998 | # df1['qpreds'] = df1['seqs'].apply(lambda x: qdict.get(x, 11)) 1999 | 2000 | # # df1['qz'] = df1['seqs'].apply(lambda x: qval_dict.get(qdict.get(x, 11), 1.0)) 2001 | 2002 | #to here 2003 | 2004 | df1.to_csv(base_out_name + '_PFMs_ML.tsv', sep='\t', index=False) 2005 | 2006 | df1 = df1[df1['qpreds'] <= 10] 2007 | 2008 | resdict = {} 2009 | resdict['seqs'] = df1['seqs'].values 2010 | resdict['qpreds'] = df1['qpreds'].values 2011 | resdict['ids'] = df1['ids'].values 2012 | 2013 | mass_diff = resdict['qpreds'] 2014 | rt_diff = [] 2015 | if skip_ml: 2016 | mass_diff = np.zeros(len(mass_diff)) 2017 | 2018 | p1 = set(resdict['seqs']) 2019 | 2020 | prots_spc2 = defaultdict(set) 2021 | for pep, proteins in pept_prot.items(): 2022 | if pep in p1: 2023 | for protein in proteins: 2024 | prots_spc2[protein].add(pep) 2025 | 2026 | for k in protsN: 2027 | if k not in prots_spc2: 2028 | prots_spc2[k] = set([]) 2029 | prots_spc = dict((k, len(v)) for k, v in prots_spc2.items()) 2030 | 2031 | names_arr = np.array(list(prots_spc.keys())) 2032 | v_arr = np.array(list(prots_spc.values())) 2033 | n_arr = np.array([protsN[k] for k in prots_spc]) 2034 | 2035 | top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)] 2036 | top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)] 2037 | p = np.mean(top100decoy_score) / np.mean(top100decoy_N) 2038 | logger.info('Final stage search: probability of random match for theoretical peptide = %.3f', p) 2039 | 2040 | 2041 | prots_spc = dict() 2042 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 2043 | for idx, k in enumerate(names_arr): 2044 | prots_spc[k] = all_pvals[idx] 2045 | 2046 | sf = args['separate_figures'] 2047 | 2048 | top_proteins = final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], out_log, fname, separate_figures=sf) 2049 | 2050 | # pept_prot_limited = dict() 2051 | # for k, v in pept_prot.items(): 2052 | # tmp = set(zz for zz in v if zz in top_proteins) 2053 | # if len(tmp): 2054 | # pept_prot_limited[k] = tmp 2055 | 2056 | # resdict3 = get_resdict(pept_prot_limited, acc_l=25, acc_r=25, aa_mass=kwargs['aa_mass']) 2057 | 2058 | # p1 = set(resdict3['seqs']) 2059 | 2060 | # if deeplc_path: 2061 | 2062 | # pepdict = dict() 2063 | 2064 | # seqs_batch = list(p1) 2065 | 2066 | # df_for_check = pd.DataFrame({ 2067 | # 'seq': seqs_batch, 2068 | # 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in seqs_batch], 2069 | # }) 2070 | 2071 | # df_for_check['pr'] = dlc.make_preds(seq_df=df_for_check) 2072 | 2073 | # pepdict_batch = df_for_check.set_index('seq')['pr'].to_dict() 2074 | 2075 | # pepdict.update(pepdict_batch) 2076 | 2077 | 2078 | # else: 2079 | 2080 | # qin = list(p1) 2081 | # qout = [] 2082 | # pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True) 2083 | 2084 | # rt_pred = np.array([pepdict[s] for s in resdict3['seqs']]) 2085 | # rt_diff = np.array([rts[iorig] for iorig in resdict3['iorig']]) - rt_pred - XRT_shift 2086 | # # e_all = (rt_diff) ** 2 / (RT_sigma ** 2) 2087 | # # r = 9.0 2088 | # # e_ind = e_all <= r 2089 | # # resdict = filter_results(resdict, e_ind) 2090 | # # rt_diff = rt_diff[e_ind] 2091 | # # rt_pred = rt_pred[e_ind] 2092 | 2093 | 2094 | 2095 | # with open(base_out_name + '_PFMs_extended.tsv', 'w') as output: 2096 | # output.write('sequence\tmass diff\tRT diff\tpeak_id\tIntensity\tIntensitySum\tnScans\tnIsotopes\tproteins\tm/z\tRT\taveragineCorr\tcharge\tion_mobility\n') 2097 | # for seq, md, rtd, iorig in zip(resdict3['seqs'], resdict3['md'], rt_diff, resdict3['iorig']): 2098 | # peak_id = ids[iorig] 2099 | # I = Is[iorig] 2100 | # Isum = Isums[iorig] 2101 | # nScans = Scans[iorig] 2102 | # nIsotopes = Isotopes[iorig] 2103 | # mzr = mzraw[iorig] 2104 | # rtr = rts[iorig] 2105 | # av = avraw[iorig] 2106 | # ch = charges[iorig] 2107 | # im = imraw[iorig] 2108 | # output.write('\t'.join((seq, str(md), str(rtd), str(peak_id), str(I), str(Isum), str(nScans), str(nIsotopes), ';'.join(pept_prot[seq]), str(mzr), str(rtr), str(av), str(ch), str(im))) + '\n') 2109 | 2110 | 2111 | # print('!!!', len(set(resdict3['seqs']))) 2112 | 2113 | logger.info('The search for file %s is finished.', base_out_name) 2114 | 2115 | def worker(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2, win_sys=False): 2116 | 2117 | for item in (iter(qin.get, None) if not win_sys else qin): 2118 | mass_koef, rtt_koef = item 2119 | e_ind = mass_diff <= mass_koef 2120 | resdict2 = filter_results(resdict, e_ind) 2121 | 2122 | features_dict = dict() 2123 | for pep in set(resdict2['seqs']): 2124 | for bprot in pept_prot[pep]: 2125 | prot_score = prots_spc_basic2[bprot] 2126 | if prot_score > features_dict.get(pep, [-1, ])[-1]: 2127 | features_dict[pep] = (bprot, prot_score) 2128 | 2129 | prots_spc_basic = dict() 2130 | 2131 | p1 = set(resdict2['seqs']) 2132 | 2133 | pep_pid = defaultdict(set) 2134 | pid_pep = defaultdict(set) 2135 | banned_dict = dict() 2136 | for pep, pid in zip(resdict2['seqs'], resdict2['ids']): 2137 | pep_pid[pep].add(pid) 2138 | pid_pep[pid].add(pep) 2139 | if pep in banned_dict: 2140 | banned_dict[pep] += 1 2141 | else: 2142 | banned_dict[pep] = 1 2143 | 2144 | banned_pids_total = set() 2145 | 2146 | if len(p1): 2147 | prots_spc_final = dict() 2148 | prots_spc_copy = False 2149 | prots_spc2 = False 2150 | unstable_prots = set() 2151 | p0 = False 2152 | 2153 | prev_best_score = 1e6 2154 | 2155 | names_arr = False 2156 | tmp_spc_new = False 2157 | decoy_set = False 2158 | 2159 | cnt_target_final = 0 2160 | cnt_decoy_final = 0 2161 | 2162 | while 1: 2163 | if not prots_spc2: 2164 | 2165 | best_match_dict = dict() 2166 | n_map_dict = defaultdict(list) 2167 | for k, v in protsN.items(): 2168 | n_map_dict[v].append(k) 2169 | 2170 | decoy_set = set() 2171 | for k in protsN: 2172 | if isdecoy_key(k): 2173 | decoy_set.add(k) 2174 | # decoy_set = list(decoy_set) 2175 | 2176 | 2177 | prots_spc2 = defaultdict(set) 2178 | for pep, proteins in pept_prot.items(): 2179 | if pep in p1: 2180 | for protein in proteins: 2181 | if protein == features_dict[pep][0]: 2182 | prots_spc2[protein].add(pep) 2183 | 2184 | for k in protsN: 2185 | if k not in prots_spc2: 2186 | prots_spc2[k] = set([]) 2187 | prots_spc2 = dict(prots_spc2) 2188 | unstable_prots = set(prots_spc2.keys()) 2189 | 2190 | top100decoy_N = sum([val for key, val in protsN.items() if isdecoy_key(key)]) 2191 | 2192 | names_arr = np.array(list(prots_spc2.keys())) 2193 | n_arr = np.array([protsN[k] for k in names_arr]) 2194 | 2195 | tmp_spc_new = dict((k, len(v)) for k, v in prots_spc2.items()) 2196 | 2197 | 2198 | top100decoy_score_tmp = {dprot: tmp_spc_new.get(dprot, 0) for dprot in decoy_set} 2199 | top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp.values())) 2200 | 2201 | prots_spc_basic = False 2202 | 2203 | prots_spc = tmp_spc_new 2204 | if not prots_spc_copy: 2205 | prots_spc_copy = deepcopy(prots_spc) 2206 | 2207 | # for idx, v in enumerate(decoy_set): 2208 | # if v in unstable_prots: 2209 | # top100decoy_score_tmp_sum -= top100decoy_score_tmp[idx] 2210 | # top100decoy_score_tmp[idx] = prots_spc.get(v, 0) 2211 | # top100decoy_score_tmp_sum += top100decoy_score_tmp[idx] 2212 | 2213 | 2214 | for v in decoy_set.intersection(unstable_prots): 2215 | top100decoy_score_tmp_sum -= top100decoy_score_tmp[v] 2216 | top100decoy_score_tmp[v] = prots_spc.get(v, 0) 2217 | top100decoy_score_tmp_sum += top100decoy_score_tmp[v] 2218 | 2219 | 2220 | # p = float(sum(top100decoy_score_tmp)) / top100decoy_N 2221 | p = top100decoy_score_tmp_sum / top100decoy_N 2222 | if not p0: 2223 | p0 = float(p) 2224 | 2225 | n_change = set(protsN[k] for k in unstable_prots) 2226 | if len(best_match_dict) == 0: 2227 | n_change = sorted(n_change) 2228 | for n_val in n_change: 2229 | for k in n_map_dict[n_val]: 2230 | v = prots_spc[k] 2231 | if n_val not in best_match_dict or v > prots_spc[best_match_dict[n_val]]: 2232 | best_match_dict[n_val] = k 2233 | n_arr_small = [] 2234 | names_arr_small = [] 2235 | v_arr_small = [] 2236 | max_k = 0 2237 | 2238 | for k, v in best_match_dict.items(): 2239 | num_matched = prots_spc[v] 2240 | if num_matched >= max_k: 2241 | 2242 | max_k = num_matched 2243 | 2244 | n_arr_small.append(k) 2245 | names_arr_small.append(v) 2246 | v_arr_small.append(prots_spc[v]) 2247 | 2248 | prots_spc_basic = dict() 2249 | all_pvals = utils.calc_sf_all(np.array(v_arr_small), n_arr_small, p) 2250 | 2251 | for idx, k in enumerate(names_arr_small): 2252 | prots_spc_basic[k] = all_pvals[idx] 2253 | 2254 | best_prot = utils.keywithmaxval(prots_spc_basic) 2255 | 2256 | best_score = min(prots_spc_basic[best_prot], prev_best_score) 2257 | prev_best_score = best_score 2258 | 2259 | unstable_prots = set() 2260 | if best_prot not in prots_spc_final: 2261 | prots_spc_final[best_prot] = best_score 2262 | banned_pids = set() 2263 | for pep in prots_spc2[best_prot]: 2264 | for pid in pep_pid[pep]: 2265 | if pid not in banned_pids_total: 2266 | banned_pids.add(pid) 2267 | for pid in banned_pids:#banned_pids.difference(banned_pids_total): 2268 | for pep in pid_pep[pid]: 2269 | banned_dict[pep] -= 1 2270 | if banned_dict[pep] == 0: 2271 | best_prot_val = features_dict[pep][0] 2272 | tmp_spc_new[best_prot_val] -= 1 2273 | unstable_prots.add(best_prot_val) 2274 | # for bprot in pept_prot[pep]: 2275 | # if bprot == best_prot_val: 2276 | # tmp_spc_new[bprot] -= 1 2277 | # unstable_prots.add(bprot) 2278 | 2279 | banned_pids_total.update(banned_pids) 2280 | 2281 | if best_prot in decoy_set: 2282 | cnt_decoy_final += 1 2283 | else: 2284 | cnt_target_final += 1 2285 | 2286 | else: 2287 | 2288 | v_arr = np.array([prots_spc[k] for k in names_arr]) 2289 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 2290 | for idx, k in enumerate(names_arr): 2291 | prots_spc_basic[k] = all_pvals[idx] 2292 | 2293 | for k, v in prots_spc_basic.items(): 2294 | if k not in prots_spc_final: 2295 | prots_spc_final[k] = v 2296 | 2297 | break 2298 | 2299 | prots_spc_basic[best_prot] = 0 2300 | 2301 | try: 2302 | prot_fdr = cnt_decoy_final / cnt_target_final 2303 | # prot_fdr = aux.fdr(prots_spc_final.items(), is_decoy=isdecoy) 2304 | except: 2305 | prot_fdr = 100.0 2306 | if prot_fdr >= 12.5 * fdr: 2307 | 2308 | v_arr = np.array([prots_spc[k] for k in names_arr]) 2309 | all_pvals = utils.calc_sf_all(v_arr, n_arr, p) 2310 | for idx, k in enumerate(names_arr): 2311 | prots_spc_basic[k] = all_pvals[idx] 2312 | 2313 | for k, v in prots_spc_basic.items(): 2314 | if k not in prots_spc_final: 2315 | prots_spc_final[k] = v 2316 | break 2317 | 2318 | if mass_koef == 9: 2319 | item2 = prots_spc_copy 2320 | else: 2321 | item2 = False 2322 | if not win_sys: 2323 | qout.put((prots_spc_final, item2)) 2324 | else: 2325 | qout.append((prots_spc_final, item2)) 2326 | if not win_sys: 2327 | qout.put(None) 2328 | else: 2329 | return qout 2330 | -------------------------------------------------------------------------------- /ms1searchpy/models/CSD_model_LCMSMS.hdf5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/ms1searchpy/models/CSD_model_LCMSMS.hdf5 -------------------------------------------------------------------------------- /ms1searchpy/ms1todiffacto.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import argparse 3 | import pandas as pd 4 | import subprocess 5 | import logging 6 | 7 | def run(): 8 | parser = argparse.ArgumentParser( 9 | description='run diffacto for ms1searchpy results', 10 | epilog=''' 11 | 12 | Example usage 13 | ------------- 14 | $ ms1todiffacto -S1 sample1_1_proteins.tsv sample1_n_proteins.tsv -S2 sample2_1_proteins.tsv sample2_n_proteins.tsv 15 | ------------- 16 | ''', 17 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 18 | 19 | parser.add_argument('-dif', help='path to Diffacto', required=True) 20 | parser.add_argument('-S1', nargs='+', help='input files for S1 sample', required=True) 21 | parser.add_argument('-S2', nargs='+', help='input files for S2 sample', required=True) 22 | parser.add_argument('-S3', nargs='+', help='input files for S3 sample') 23 | parser.add_argument('-S4', nargs='+', help='input files for S4 sample') 24 | parser.add_argument('-S5', nargs='+', help='input files for S5 sample') 25 | parser.add_argument('-S6', nargs='+', help='input files for S6 sample') 26 | parser.add_argument('-S7', nargs='+', help='input files for S7 sample') 27 | parser.add_argument('-S8', nargs='+', help='input files for S8 sample') 28 | parser.add_argument('-S9', nargs='+', help='input files for S9 sample') 29 | parser.add_argument('-S10', nargs='+', help='input files for S10 sample') 30 | parser.add_argument('-S11', nargs='+', help='input files for S11 sample') 31 | parser.add_argument('-S12', nargs='+', help='input files for S12 sample') 32 | parser.add_argument('-peptides', help='name of output peptides file', default='peptides.txt') 33 | parser.add_argument('-samples', help='name of output samples file', default='sample.txt') 34 | parser.add_argument('-allowed_prots', help='path to allowed prots', default='') 35 | parser.add_argument('-out', help='name of diffacto output file', default='diffacto_out.txt') 36 | parser.add_argument('-norm', help='normalization method. Can be average, median, GMM or None', default='None') 37 | parser.add_argument('-impute_threshold', help='impute_threshold for missing values fraction', default='0.75') 38 | parser.add_argument('-min_samples', help='minimum number of samples for peptide usage', default='3') 39 | parser.add_argument('-debug', help='Produce debugging output', action='store_true') 40 | args = vars(parser.parse_args()) 41 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 42 | datefmt='[%H:%M:%S]', level=[logging.INFO, logging.DEBUG][args['debug']]) 43 | logger = logging.getLogger(__name__) 44 | 45 | replace_label = '_proteins.tsv' 46 | 47 | df_final = False 48 | 49 | allowed_prots = set() 50 | allowed_peptides = set() 51 | allowed_prots_all = set() 52 | 53 | all_labels = [] 54 | 55 | if not args['allowed_prots']: 56 | 57 | for i in range(1, 13, 1): 58 | sample_num = 'S%d' % (i, ) 59 | if args[sample_num]: 60 | for z in args[sample_num]: 61 | df0 = pd.read_table(z) 62 | allowed_prots.update(df0['dbname']) 63 | else: 64 | for prot in open(args['allowed_prots'], 'r'): 65 | allowed_prots.add(prot.strip()) 66 | 67 | 68 | for i in range(1, 13, 1): 69 | sample_num = 'S%d' % (i, ) 70 | if args[sample_num]: 71 | for z in args[sample_num]: 72 | df0 = pd.read_table(z.replace('_proteins.tsv', '_PFMs_ML.tsv')) 73 | df0 = df0[df0['qpreds'] <= 10] 74 | allowed_peptides.update(df0['seqs']) 75 | 76 | 77 | if not args['allowed_prots']: 78 | for i in range(1, 13, 1): 79 | sample_num = 'S%d' % (i, ) 80 | if args[sample_num]: 81 | for z in args[sample_num]: 82 | df3 = pd.read_table(z.replace('_proteins.tsv', '_PFMs.tsv')) 83 | df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)] 84 | 85 | df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))] 86 | for dbnames in set(df3_tmp['proteins'].values): 87 | for dbname in dbnames.split(';'): 88 | allowed_prots_all.add(dbname) 89 | else: 90 | allowed_prots_all = allowed_prots 91 | 92 | 93 | for i in range(1, 13, 1): 94 | sample_num = 'S%d' % (i, ) 95 | if args[sample_num]: 96 | for z in args[sample_num]: 97 | label = z.replace(replace_label, '') 98 | all_labels.append(label) 99 | df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv')) 100 | logger.debug(z) 101 | logger.debug(z.replace(replace_label, '_PFMs.tsv')) 102 | logger.debug(df3.shape) 103 | logger.debug(df3.columns) 104 | 105 | df3 = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots_all for z in x.split(';')))] 106 | df3['proteins'] = df3['proteins'].apply(lambda x: ';'.join([z for z in x.split(';') if z in allowed_prots_all])) 107 | 108 | df3['origseq'] = df3['sequence'] 109 | df3['sequence'] = df3['sequence'] + df3['charge'].astype(int).astype(str) + df3['ion_mobility'].astype(str) 110 | 111 | df3 = df3.sort_values(by='Intensity', ascending=False) 112 | df3 = df3.drop_duplicates(subset='sequence') 113 | # df3 = df3.explode('proteins') 114 | 115 | df3[label] = df3['Intensity'] 116 | df3['protein'] = df3['proteins'] 117 | df3['peptide'] = df3['sequence'] 118 | df3 = df3[['origseq', 'peptide', 'protein', label]] 119 | 120 | 121 | 122 | if df_final is False: 123 | df_final = df3.reset_index(drop=True) 124 | else: 125 | df_final = df_final.reset_index(drop=True).merge(df3.reset_index(drop=True), on='peptide', how='outer') 126 | df_final.protein_x.fillna(value=df_final.protein_y, inplace=True) 127 | df_final.origseq_x.fillna(value=df_final.origseq_y, inplace=True) 128 | df_final['protein'] = df_final['protein_x'] 129 | df_final['origseq'] = df_final['origseq_x'] 130 | 131 | df_final = df_final.drop(columns=['protein_x', 'protein_y']) 132 | df_final = df_final.drop(columns=['origseq_x', 'origseq_y']) 133 | 134 | 135 | df_final['intensity_median'] = df_final[all_labels].median(axis=1) 136 | df_final['nummissing'] = df_final[all_labels].isna().sum(axis=1) 137 | logger.debug(df_final['nummissing']) 138 | df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False)) 139 | df_final = df_final.drop_duplicates(subset=('origseq', 'protein')) 140 | 141 | logger.debug(df_final.columns) 142 | df_final = df_final.set_index('peptide') 143 | df_final['proteins'] = df_final['protein'] 144 | df_final = df_final.drop(columns=['protein']) 145 | cols = df_final.columns.tolist() 146 | cols.remove('proteins') 147 | cols.insert(0, 'proteins') 148 | df_final = df_final[cols] 149 | df_final.fillna(value='') 150 | df_final.to_csv(args['peptides'], sep=',') 151 | 152 | out = open(args['samples'], 'w') 153 | for i in range(1, 13, 1): 154 | sample_num = 'S%d' % (i, ) 155 | if args[sample_num]: 156 | for z in args[sample_num]: 157 | label = z.replace(replace_label, '') 158 | out.write(label + '\t' + sample_num + '\n') 159 | out.close() 160 | 161 | subprocess.call([args['dif'], '-i', args['peptides'], '-samples', args['samples'], '-out',\ 162 | args['out'], '-normalize', args['norm'], '-impute_threshold', args['impute_threshold'], '-min_samples', args['min_samples']]) 163 | 164 | 165 | 166 | if __name__ == '__main__': 167 | run() 168 | -------------------------------------------------------------------------------- /ms1searchpy/search.py: -------------------------------------------------------------------------------- 1 | from . import main 2 | import argparse 3 | import logging 4 | import os 5 | 6 | def run(): 7 | parser = argparse.ArgumentParser( 8 | description='Search proteins using LC-MS spectra', 9 | epilog=''' 10 | 11 | Example usage 12 | ------------- 13 | $ search.py input.mzML input2.mzML -d human.fasta -ad 1 -fdr 5.0 14 | ------------- 15 | ''', 16 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 17 | 18 | parser.add_argument('files', help='input mzML or .tsv files with peptide features', nargs='+') 19 | parser.add_argument('-d', '-db', help='path to protein fasta file', required=True) 20 | parser.add_argument('-o', help='path to output folder', default='') 21 | parser.add_argument('-ptol', help='precursor mass tolerance in ppm', default=10.0, type=float) 22 | parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float) 23 | parser.add_argument('-i', help='minimum number of isotopes', default=2, type=int) 24 | parser.add_argument('-ci', help='minimum number of isotopes for mass and RT calibration', default=4, type=int) 25 | parser.add_argument('-csc', help='minimum number of scans for mass and RT calibration', default=4, type=int) 26 | parser.add_argument('-ts', help='Two-stage RT training: 0 - turn off, 1 - turn one, 2 - turn on and use additive model in the first stage (Default)', default=2, type=int) 27 | parser.add_argument('-sc', help='minimum number of scans for peptide feature', default=2, type=int) 28 | parser.add_argument('-lmin', help='min length of peptides', default=7, type=int) 29 | parser.add_argument('-lmax', help='max length of peptides', default=30, type=int) 30 | parser.add_argument('-e', help='cleavage rule in quotes!. X!Tandem style for cleavage rules: "[RK]|{P}" for trypsin,\ 31 | "[X]|[D]" for asp-n or "[RK]|{P},[K]|[X]" for mix of trypsin and lys-c', default='[RK]|{P}') 32 | parser.add_argument('-mc', help='number of missed cleavages', default=0, type=int) 33 | parser.add_argument('-cmin', help='min precursor charge', default=1, type=int) 34 | parser.add_argument('-cmax', help='max precursor charge', default=4, type=int) 35 | parser.add_argument('-fmods', help='fixed modifications. Use "[" and "]" for N-term and C-term amino acids. in psiname1@aminoacid1,psiname2@aminoacid2 format', default='Carbamidomethyl@C') 36 | parser.add_argument('-fmods_legend', help='PSI Names for extra fixed modifications. Oxidation, Carbamidomethyl and TMT6plex are stored by default in source code. in psiname1@monomass1,psiname2@monomass2 format', default='') 37 | parser.add_argument('-ad', help='add decoy', default=0, type=int) 38 | parser.add_argument('-ml', help='use machine learning for PFMs', default=1, type=int) 39 | parser.add_argument('-prefix', help='decoy prefix', default='DECOY_') 40 | parser.add_argument('-sf', '--separate-figures', action='store_true', help='save figures as separate files') 41 | parser.add_argument('-nproc', help='number of processes', default=4, type=int) 42 | parser.add_argument('-force_nproc', help='Force using multiprocessing for Windows', action='store_true') 43 | parser.add_argument('-deeplc', help='use deeplc: 0 - turn off, 1 - turn on', default=0, type=int) 44 | parser.add_argument('-deeplc_batch_num', help='batch_num for deeplc', default=100000, type=int) 45 | parser.add_argument('-deeplc_model_path', help='path to deeplc model or folder with deeplc models', default='') 46 | parser.add_argument('-deeplc_library', help='path to deeplc library', default='') 47 | parser.add_argument('-pl', help='path to list of peptides for RT calibration', default='') 48 | parser.add_argument('-mcalib', help='mass calibration: 2 - group by ion mobility and RT, 1 - by RT, 0 - no calibration', default=0, type=int) 49 | parser.add_argument('-debug', help='Produce debugging output', action='store_true') 50 | parser.add_argument('-save_calib', help='Save RT calibration list', action='store_true') 51 | parser.add_argument('-check_unique', help='Experimental. check_unique', default=1, type=int) 52 | parser.add_argument('-es', help='Experimental. Use extra stage for RT calibration', default=0, type=int) 53 | parser.add_argument('-csd', help='Employ limited (1) or comlete (2) chrge-state distribution model; for complete model path to ThermoRawFileParser 1.4.2+ has to be provided. Default - (0) - don\'t use charge-state distribution' , default=0, type=int) 54 | parser.add_argument('-trfp', help='Path to ThermoRawFileParser executable', default='') 55 | 56 | args = vars(parser.parse_args()) 57 | logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s', 58 | datefmt='[%H:%M:%S]', level=[logging.INFO, logging.DEBUG][args['debug']]) 59 | logging.getLogger('matplotlib.font_manager').disabled = True 60 | logging.getLogger('matplotlib.category').disabled = True 61 | logging.getLogger('matplotlib').setLevel(logging.WARNING) 62 | logger = logging.getLogger(__name__) 63 | 64 | 65 | if os.name == 'nt' and not args['force_nproc']: 66 | logger.warning('Turning off multiprocessing for Windows system. Use -force_nproc option to turn it on') 67 | args['nproc'] = 1 68 | 69 | logger.debug('Starting with args: %s', args) 70 | main.process_file(args) 71 | 72 | if __name__ == '__main__': 73 | run() 74 | -------------------------------------------------------------------------------- /ms1searchpy/utils.py: -------------------------------------------------------------------------------- 1 | from pyteomics import fasta, parser, mass 2 | import os 3 | from scipy.stats import binom 4 | import numpy as np 5 | import pandas as pd 6 | import random 7 | import itertools 8 | from biosaur2 import main as bio_main 9 | import logging 10 | from copy import deepcopy 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | # Temporary for pyteomics <= Version 4.5.5 bug 15 | if 'H-' in mass.std_aa_mass: 16 | del mass.std_aa_mass['H-'] 17 | if '-OH' in mass.std_aa_mass: 18 | del mass.std_aa_mass['-OH'] 19 | 20 | mods_custom_dict = { 21 | 'Oxidation': 15.994915, 22 | 'Carbamidomethyl': 57.021464, 23 | 'TMT6plex': 229.162932, 24 | } 25 | 26 | 27 | def get_aa_mass_with_fixed_mods(fmods, fmods_legend): 28 | 29 | if fmods_legend: 30 | for mod in fmods_legend.split(','): 31 | psiname, m = mod.split('@') 32 | mods_custom_dict[psiname] = float(m) 33 | 34 | aa_mass = deepcopy(mass.std_aa_mass) 35 | aa_to_psi = dict() 36 | 37 | mass_h2o = mass.calculate_mass('H2O') 38 | for k in list(aa_mass.keys()): 39 | aa_mass[k] = round(mass.calculate_mass(sequence=k) - mass_h2o, 7) 40 | 41 | if fmods: 42 | for mod in fmods.split(','): 43 | psiname, aa = mod.split('@') 44 | if psiname not in mods_custom_dict: 45 | logger.error('PSI Name for modification %s is missing in the modification legend' % (psiname, )) 46 | raise Exception('Exception: missing PSI Name for modification') 47 | if aa == '[': 48 | aa_mass['Nterm'] = float(mods_custom_dict[psiname])#float(m) 49 | aa_to_psi['Nterm'] = psiname 50 | elif aa == ']': 51 | aa_mass['Cterm'] = float(mods_custom_dict[psiname])#float(m) 52 | aa_to_psi['Cterm'] = psiname 53 | else: 54 | aa_mass[aa] += float(mods_custom_dict[psiname])#float(m) 55 | aa_to_psi[aa] = psiname 56 | 57 | logger.debug(aa_mass) 58 | 59 | return aa_mass, aa_to_psi 60 | 61 | 62 | def mods_for_deepLC(seq, aa_to_psi): 63 | if 'Nterm' in aa_to_psi: 64 | mods_list = ['0|%s' % (aa_to_psi['Nterm'], ), ] 65 | else: 66 | mods_list = [] 67 | mods_list.extend([str(idx+1)+'|%s' % (aa_to_psi[aa]) for idx, aa in enumerate(seq) if aa in aa_to_psi]) 68 | if 'Cterm' in aa_to_psi: 69 | mods_list.append(['-1|%s' % (aa_to_psi['Cterm'], ), ]) 70 | return '|'.join(mods_list) 71 | 72 | def recalc_spc(banned_dict, unstable_prots, prots_spc2): 73 | tmp = dict() 74 | for k in unstable_prots: 75 | tmp[k] = sum(banned_dict.get(l, 1) > 0 for l in prots_spc2[k]) 76 | return tmp 77 | 78 | def iterate_spectra(fname, min_ch, max_ch, min_isotopes, min_scans, nproc, check_unique=True): 79 | if os.path.splitext(fname)[-1].lower() == '.mzml': 80 | args = { 81 | 'file': fname, 82 | 'mini': 1, 83 | 'minmz': 350, 84 | 'maxmz': 1500, 85 | 'pasefmini': 100, 86 | 'htol': 8, 87 | 'itol': 8, 88 | 'paseftol': 0.05, 89 | 'nm': 0, 90 | 'o': '', 91 | 'hvf': 1.3, 92 | 'ivf': 5, 93 | 'minlh': 2, 94 | 'pasefminlh': 1, 95 | 'nprocs': nproc, 96 | 'cmin': 1, 97 | 'cmax': 6, 98 | 'dia': False, 99 | 'diahtol': 25, 100 | 'diaminlh': 1, 101 | 'mgf': '', 102 | 'tof': False, 103 | 'profile': False, 104 | 'write_hills': False, 105 | 'debug': False # actual debug value is set through logging, not here 106 | } 107 | bio_main.process_file(args) 108 | fname = os.path.splitext(fname)[0] + '.features.tsv' 109 | 110 | df_features = pd.read_csv(fname, sep='\t') 111 | 112 | required_columns = [ 113 | 'nIsotopes', 114 | 'nScans', 115 | 'charge', 116 | 'massCalib', 117 | 'rtApex', 118 | 'mz', 119 | ] 120 | 121 | if not all(req_col in df_features.columns for req_col in required_columns): 122 | logger.error('input feature file have missing columns: %s', ';'.join([req_col for req_col in required_columns if req_col not in df_features.columns])) 123 | raise Exception('Exception: wrong columns in feature file') 124 | logger.info('Total number of peptide isotopic clusters: %d', len(df_features)) 125 | 126 | if 'id' not in df_features.columns: 127 | df_features['id'] = df_features.index 128 | if 'FAIMS' not in df_features.columns: 129 | df_features['FAIMS'] = 0 130 | if 'im' not in df_features.columns: 131 | df_features['im'] = 0 132 | 133 | # if 'mz_std_1' in df_features.columns: 134 | # df_features['mz_diff_ppm_1'] = df_features.apply(lambda x: 1e6 * (x['mz'] - (x['mz_std_1'] - 1.00335 / x['charge'])) / x['mz'], axis=1) 135 | # df_features['mz_diff_ppm_2'] = -100 136 | # df_features.loc[df_features['intensity_2'] > 0, 'mz_diff_ppm_2'] = df_features.loc[df_features['intensity_2'] > 0, :].apply(lambda x: 1e6 * (x['mz'] - (x['mz_std_2'] - 2 * 1.00335 / x['charge'])) / x['mz'], axis=1) 137 | 138 | # df_features['I-0-1'] = df_features.apply(lambda x: x['intensityApex'] / x['intensity_1'], axis=1) 139 | # df_features['I-0-2'] = -1 140 | # df_features.loc[df_features['intensity_2'] > 0, 'I-0-2'] = df_features.loc[df_features['intensity_2'] > 0, :].apply(lambda x: x['intensityApex'] / x['intensity_2'], axis=1) 141 | 142 | if check_unique: 143 | # Check unique ids 144 | if len(df_features['id']) != len(set(df_features['id'])): 145 | df_features['id'] = df_features.index + 1 146 | 147 | # Remove features with low number of isotopes 148 | df_features = df_features[df_features['nIsotopes'] >= min_isotopes] 149 | 150 | # Remove features with low number of Scans 151 | df_features = df_features[df_features['nScans'] >= min_scans] 152 | 153 | # Remove features using min and max charges 154 | df_features = df_features[df_features['charge'].apply(lambda x: min_ch <= x <= max_ch)] 155 | 156 | return df_features 157 | 158 | def peptide_gen(args): 159 | 160 | 161 | 162 | prefix = args['prefix'] 163 | enzyme = get_enzyme(args['e']) 164 | mc = args['mc'] 165 | minlen = args['lmin'] 166 | maxlen = args['lmax'] 167 | for prot in prot_gen(args): 168 | for pep in prot_peptides(prot[1], enzyme, mc, minlen, maxlen, is_decoy=prot[0].startswith(prefix)): 169 | yield pep 170 | 171 | def get_enzyme(enzyme): 172 | return convert_tandem_cleave_rule_to_regexp(enzyme) 173 | # if enzyme in parser.expasy_rules: 174 | # return parser.expasy_rules.get(enzyme) 175 | # else: 176 | # try: 177 | # enzyme = convert_tandem_cleave_rule_to_regexp(enzyme) 178 | # return enzyme 179 | # except: 180 | # return enzyme 181 | 182 | def prot_gen(args): 183 | db = args['d'] 184 | 185 | with fasta.read(db) as f: 186 | for p in f: 187 | yield p 188 | 189 | def prepare_decoy_db(args): 190 | add_decoy = args['ad'] 191 | if add_decoy: 192 | 193 | prefix = args['prefix'] 194 | db = args['d'] 195 | out1, out2 = os.path.splitext(db) 196 | out_db = out1 + '_shuffled' + out2 197 | logger.info('Creating decoy database: %s', out_db) 198 | 199 | extra_check = False 200 | if '{' in args['e']: 201 | extra_check = True 202 | if extra_check: 203 | banned_pairs = set() 204 | banned_aa = set() 205 | for enzyme_local in args['e'].split(','): 206 | if '{' in enzyme_local: 207 | lpart, rpart = enzyme_local.split('|') 208 | for aa_left, aa_right in itertools.product(lpart[1:-1], rpart[1:-1]): 209 | banned_aa.add(aa_left) 210 | banned_aa.add(aa_right) 211 | banned_pairs.add(aa_left+aa_right) 212 | 213 | logger.debug(banned_aa) 214 | logger.debug(banned_pairs) 215 | 216 | enzyme = get_enzyme(args['e']) 217 | cleave_rule_custom = enzyme + '|' + '([BXZUO])' 218 | # cleave_rule_custom = '([RKBXZUO])' 219 | logger.debug(cleave_rule_custom) 220 | 221 | shuf_map = dict() 222 | 223 | prots = [] 224 | 225 | for p in fasta.read(db): 226 | if not p[0].startswith(prefix): 227 | target_peptides = [x[1] for x in parser.icleave(p[1], cleave_rule_custom, 0)] 228 | 229 | checked_peptides = set() 230 | sample_list = [] 231 | for idx, pep in enumerate(target_peptides): 232 | 233 | if len(pep) > 2: 234 | pep_tmp = pep[1:-1] 235 | if extra_check: 236 | for bp in banned_pairs: 237 | if bp in pep_tmp: 238 | pep_tmp = pep_tmp.replace(bp, '') 239 | checked_peptides.add(idx) 240 | 241 | 242 | sample_list.extend(pep_tmp) 243 | random.shuffle(sample_list) 244 | idx_for_shuffle = 0 245 | 246 | decoy_peptides = [] 247 | for idx, pep in enumerate(target_peptides): 248 | 249 | if len(pep) > 2: 250 | 251 | if pep in shuf_map: 252 | tmp_seq = shuf_map[pep] 253 | else: 254 | if not extra_check or idx not in checked_peptides: 255 | tmp_seq = pep[0] 256 | for pep_aa in pep[1:-1]: 257 | tmp_seq += sample_list[idx_for_shuffle] 258 | idx_for_shuffle += 1 259 | tmp_seq += pep[-1] 260 | else: 261 | max_l = len(pep) 262 | tmp_seq = '' 263 | ii = 0 264 | while ii < max_l - 1: 265 | # for ii in range(max_l-1): 266 | if pep[ii] in banned_aa and pep[ii+1] in banned_aa and pep[ii] + pep[ii+1] in banned_pairs: 267 | tmp_seq += pep[ii] + pep[ii+1] 268 | ii += 1 269 | else: 270 | if ii == 0: 271 | tmp_seq += pep[ii] 272 | else: 273 | tmp_seq += sample_list[idx_for_shuffle] 274 | idx_for_shuffle += 1 275 | 276 | ii += 1 277 | tmp_seq += pep[max_l-1] 278 | 279 | shuf_map[pep] = tmp_seq 280 | else: 281 | tmp_seq = pep 282 | 283 | decoy_peptides.append(tmp_seq) 284 | 285 | assert len(target_peptides) == len(decoy_peptides) 286 | 287 | prots.append((p[0], ''.join(target_peptides))) 288 | prots.append(('DECOY_' + p[0], ''.join(decoy_peptides))) 289 | 290 | fasta.write(prots, open(out_db, 'w')).close() 291 | args['d'] = out_db 292 | args['ad'] = 0 293 | return args 294 | 295 | seen_target = set() 296 | seen_decoy = set() 297 | def prot_peptides(prot_seq, enzyme, mc, minlen, maxlen, is_decoy, dont_use_seen_peptides=False): 298 | 299 | 300 | dont_use_fast_valid = parser.fast_valid(prot_seq) 301 | peptides = parser.cleave(prot_seq, enzyme, mc) 302 | for pep in peptides: 303 | plen = len(pep) 304 | if minlen <= plen <= maxlen: 305 | forms = [] 306 | if dont_use_fast_valid or pep in seen_target or pep in seen_decoy or parser.fast_valid(pep): 307 | if plen <= maxlen: 308 | forms.append(pep) 309 | for f in forms: 310 | if dont_use_seen_peptides: 311 | yield f 312 | else: 313 | if f not in seen_target and f not in seen_decoy: 314 | if is_decoy: 315 | seen_decoy.add(f) 316 | else: 317 | seen_target.add(f) 318 | yield f 319 | 320 | def get_prot_pept_map(args): 321 | seen_target.clear() 322 | seen_decoy.clear() 323 | 324 | 325 | prefix = args['prefix'] 326 | enzyme = get_enzyme(args['e']) 327 | mc = args['mc'] 328 | minlen = args['lmin'] 329 | maxlen = args['lmax'] 330 | 331 | pept_prot = dict() 332 | protsN = dict() 333 | 334 | target_prot_count = 0 335 | decoy_prot_count = 0 336 | target_peps = set() 337 | decoy_peps = set() 338 | 339 | 340 | 341 | for desc, prot in prot_gen(args): 342 | dbinfo = desc.split(' ')[0] 343 | for pep in prot_peptides(prot, enzyme, mc, minlen, maxlen, desc.startswith(prefix), dont_use_seen_peptides=True): 344 | pept_prot.setdefault(pep, set()).add(dbinfo) 345 | protsN.setdefault(dbinfo, set()).add(pep) 346 | for k, v in protsN.items(): 347 | if k.startswith(prefix): 348 | decoy_prot_count += 1 349 | decoy_peps.update(v) 350 | else: 351 | target_prot_count += 1 352 | target_peps.update(v) 353 | 354 | protsN[k] = len(v) 355 | 356 | logger.info('Database information:') 357 | logger.info('Target/Decoy proteins: %d/%d', target_prot_count, decoy_prot_count) 358 | target_peps_number = len(target_peps) 359 | decoy_peps_number = len(decoy_peps) 360 | intersection_number = len(target_peps.intersection(decoy_peps)) / (target_peps_number + decoy_peps_number) 361 | logger.info('Target/Decoy peptides: %d/%d', target_peps_number, decoy_peps_number) 362 | logger.info('Target-Decoy peptide intersection: %.1f %%', 363 | 100 * intersection_number) 364 | 365 | ml_correction = decoy_peps_number * (1 - intersection_number) / target_peps_number * 0.5 366 | del decoy_peps 367 | del target_peps 368 | return protsN, pept_prot, ml_correction 369 | 370 | 371 | def convert_tandem_cleave_rule_to_regexp(cleavage_rule): 372 | 373 | def get_sense(c_term_rule, n_term_rule): 374 | if '{' in c_term_rule: 375 | return 'N' 376 | elif '{' in n_term_rule: 377 | return 'C' 378 | else: 379 | if len(c_term_rule) <= len(n_term_rule): 380 | return 'C' 381 | else: 382 | return 'N' 383 | 384 | def get_cut(cut, no_cut): 385 | aminoacids = set(parser.std_amino_acids) 386 | cut = ''.join(aminoacids & set(cut)) 387 | if '{' in no_cut: 388 | no_cut = ''.join(aminoacids & set(no_cut)) 389 | return cut, no_cut 390 | else: 391 | no_cut = ''.join(set(parser.std_amino_acids) - set(no_cut)) 392 | return cut, no_cut 393 | 394 | out_rules = [] 395 | for protease in cleavage_rule.split(','): 396 | protease = protease.replace('X', ''.join(parser.std_amino_acids)) 397 | c_term_rule, n_term_rule = protease.split('|') 398 | sense = get_sense(c_term_rule, n_term_rule) 399 | if sense == 'C': 400 | cut, no_cut = get_cut(c_term_rule, n_term_rule) 401 | else: 402 | cut, no_cut = get_cut(n_term_rule, c_term_rule) 403 | 404 | if no_cut: 405 | if sense == 'C': 406 | out_rules.append('([%s](?=[^%s]))' % (cut, no_cut)) 407 | else: 408 | out_rules.append('([^%s](?=[%s]))' % (no_cut, cut)) 409 | else: 410 | if sense == 'C': 411 | out_rules.append('([%s])' % (cut, )) 412 | else: 413 | out_rules.append('(?=[%s])' % (cut, )) 414 | return '|'.join(out_rules) 415 | 416 | 417 | def keywithmaxval(d): 418 | """ a) create a list of the dict's keys and values; 419 | b) return the key with the max value""" 420 | v=list(d.values()) 421 | k=list(d.keys()) 422 | return k[v.index(max(v))] 423 | 424 | def calc_sf_all(v, n, p, prev_best_score=False): 425 | sf_values = -np.log10(binom.sf(v-1, n, p)) 426 | sf_values[np.isnan(sf_values)] = 0 427 | sf_values[np.isinf(sf_values)] = (prev_best_score if prev_best_score is not False else max(sf_values[~np.isinf(sf_values)]) * 2) 428 | return sf_values 429 | 430 | -------------------------------------------------------------------------------- /ms1searchpy/utils_figures.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from scipy.stats import scoreatpercentile 3 | from pyteomics import mass, auxiliary as aux 4 | from collections import Counter 5 | import os.path 6 | import numpy as np 7 | import matplotlib 8 | matplotlib.use('Agg') 9 | import matplotlib.pyplot as plt 10 | from matplotlib.patches import Patch 11 | try: 12 | import seaborn 13 | seaborn.set(rc={'axes.facecolor':'#ffffff'}) 14 | seaborn.set_style('whitegrid') 15 | except ImportError: 16 | pass 17 | import re 18 | import sys 19 | import logging 20 | import pandas as pd 21 | redcolor = '#FC6264' 22 | bluecolor = '#70aed1' 23 | greencolor = '#8AA413' 24 | aa_color_1 = '#fca110' 25 | aa_color_2 = '#a41389' 26 | 27 | 28 | def _get_sf(fig): 29 | return isinstance(fig, str) 30 | 31 | 32 | def get_basic_distributions(df): 33 | if 'mz' in df.columns: 34 | mz_array = df['mz'].values 35 | rt_exp_array = df['rtApex'].values 36 | intensity_array = np.log10(df['intensityApex'].values) 37 | if 'FAIMS' in df.columns: 38 | faims_array = df['FAIMS'].replace('None', 0).values 39 | else: 40 | faims_array = np.zeros(len(df)) 41 | mass_diff_array = [] 42 | RT_diff_array = [] 43 | else: 44 | mz_array = df['m/z'].values 45 | rt_exp_array = df['RT'].values 46 | intensity_array = np.log10(df['Intensity'].values) 47 | faims_array = df['ion_mobility'].values 48 | mass_diff_array = df['mass diff'].values 49 | RT_diff_array = df['RT diff'].values 50 | charges_array = df['charge'].values 51 | nScans_array = df['nScans'].values 52 | nIsotopes_array = df['nIsotopes'].values 53 | return mz_array, rt_exp_array, charges_array, intensity_array, nScans_array, nIsotopes_array, faims_array, mass_diff_array, RT_diff_array 54 | 55 | 56 | def get_descriptor_array(df, df_f, dname): 57 | array_t = df[~df['decoy']][dname].values 58 | array_d = df[df['decoy']][dname].values 59 | array_v = df_f[dname].values 60 | return array_t, array_d, array_v 61 | 62 | 63 | def plot_hist_basic(array_all, fig, subplot_max_x, subplot_i, 64 | xlabel, ylabel='# of identifications', idtype='features', bin_size_one=False): 65 | separate_figures = _get_sf(fig) 66 | if separate_figures: 67 | plt.figure() 68 | else: 69 | fig.add_subplot(subplot_max_x, 3, subplot_i) 70 | cbins = get_bins((array_all, ), bin_size_one) 71 | if idtype == 'features': 72 | plt.hist(array_all, bins=cbins, color=redcolor, alpha=0.8, edgecolor='#EEEEEE') 73 | else: 74 | plt.hist(array_all, bins=cbins, color=greencolor, alpha=0.8, edgecolor='#EEEEEE') 75 | plt.ylabel(ylabel) 76 | plt.xlabel(xlabel) 77 | if separate_figures: 78 | plt.savefig(outpath(fig, xlabel, '.png')) 79 | plt.close() 80 | 81 | def plot_basic_figures(df, fig, subplot_max_x, subplot_start, idtype): 82 | mz_array, rt_exp_array, lengths_array, intensity_array, nScans_array, nIsotopes_array, faims_array, mass_diff_array, RT_diff_array = get_basic_distributions(df) 83 | 84 | plot_hist_basic(mz_array, fig, subplot_max_x, subplot_i=subplot_start, 85 | xlabel='%s, precursor m/z' % (idtype, ), idtype=idtype) 86 | subplot_start += 1 87 | plot_hist_basic(rt_exp_array, fig, subplot_max_x, subplot_i=subplot_start, 88 | xlabel='%s, RT experimental' % (idtype, ), idtype=idtype) 89 | subplot_start += 1 90 | plot_hist_basic(lengths_array, fig, subplot_max_x, subplot_i=subplot_start, 91 | xlabel='%s, charge' % (idtype, ), idtype=idtype, bin_size_one=True) 92 | subplot_start += 1 93 | plot_hist_basic(intensity_array, fig, subplot_max_x, subplot_i=subplot_start, 94 | xlabel='%s, log10(Intensity)' % (idtype, ), idtype=idtype) 95 | subplot_start += 1 96 | plot_hist_basic(nScans_array, fig, subplot_max_x, subplot_i=subplot_start, 97 | xlabel='%s, nScans' % (idtype, ), idtype=idtype, bin_size_one=True) 98 | subplot_start += 1 99 | plot_hist_basic(nIsotopes_array, fig, subplot_max_x, subplot_i=subplot_start, 100 | xlabel='%s, nIsotopes' % (idtype, ), idtype=idtype, bin_size_one=True) 101 | subplot_start += 1 102 | plot_hist_basic(faims_array, fig, subplot_max_x, subplot_i=subplot_start, 103 | xlabel='%s, ion mobility' % (idtype, ), idtype=idtype) 104 | if idtype=='peptides': 105 | subplot_start += 1 106 | plot_hist_basic(mass_diff_array, fig, subplot_max_x, subplot_i=subplot_start, 107 | xlabel='%s, mass diff ppm' % (idtype, ), idtype=idtype) 108 | subplot_start += 1 109 | plot_hist_basic(RT_diff_array, fig, subplot_max_x, subplot_i=subplot_start, 110 | xlabel='%s, RT diff min' % (idtype, ), idtype=idtype) 111 | 112 | 113 | def plot_protein_figures(df, df_f, fig, subplot_max_x, subplot_start, prefix): 114 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='sq'), fig, subplot_max_x, subplot_start, xlabel='proteins, sequence coverage', ylabel='# of identifications', only_true=True) 115 | subplot_start += 1 116 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='matched peptides'), fig, subplot_max_x, subplot_start, xlabel='proteins, matched peptides', ylabel='# of identifications', only_true=True, bin_size_one=True) 117 | subplot_start += 1 118 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='corrected sq'), fig, subplot_max_x, subplot_start, xlabel='proteins, corrected sequence coverage', ylabel='# of identifications', only_true=True) 119 | subplot_start += 1 120 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='corrected matched peptides'), fig, subplot_max_x, subplot_start, xlabel='proteins, corrected matched peptides', ylabel='# of identifications', only_true=True, bin_size_one=True) 121 | subplot_start += 1 122 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='score'), fig, subplot_max_x, subplot_start, xlabel='proteins, score', ylabel='# of identifications', only_true=False) 123 | subplot_start += 1 124 | plot_qvalues(df, fig, subplot_max_x, subplot_start, prefix) 125 | subplot_start += 1 126 | return subplot_start 127 | 128 | 129 | def plot_hist_descriptor(inarrays, fig, subplot_max_x, subplot_i, xlabel, ylabel='# of identifications', only_true=False, bin_size_one=False): 130 | separate_figures = _get_sf(fig) 131 | if separate_figures: 132 | plt.figure() 133 | else: 134 | fig.add_subplot(subplot_max_x, 3, subplot_i) 135 | array_t, array_d, array_v = inarrays 136 | if xlabel == 'proteins, score': 137 | logscale=True 138 | else: 139 | logscale=False 140 | if only_true: 141 | cbins, width = get_bins_for_descriptors([array_v, ], bin_size_one=bin_size_one) 142 | H3, _ = np.histogram(array_v, bins=cbins) 143 | plt.bar(cbins[:-1], H3, width, align='center',color=greencolor, alpha=1, edgecolor='#EEEEEE', log=logscale) 144 | else: 145 | cbins, width = get_bins_for_descriptors(inarrays) 146 | H1, _ = np.histogram(array_d, bins=cbins) 147 | H2, _ = np.histogram(array_t, bins=cbins) 148 | H3, _ = np.histogram(array_v, bins=cbins) 149 | plt.bar(cbins[:-1], H1, width, align='center',color=redcolor, alpha=0.4, edgecolor='#EEEEEE', log=logscale) 150 | plt.bar(cbins[:-1], H2, width, align='center',color=bluecolor, alpha=0.4, edgecolor='#EEEEEE', log=logscale) 151 | plt.bar(cbins[:-1], H3, width, align='center',color=greencolor, alpha=1, edgecolor='#EEEEEE', log=logscale) 152 | cbins = np.append(cbins[0], cbins) 153 | H1 = np.append(np.append(0, H1), 0) 154 | H2 = np.append(np.append(0, H2), 0) 155 | cbins -= width / 2 156 | plt.step(cbins, H2, where='post', color=bluecolor, alpha=0.8) 157 | plt.step(cbins, H1, where='post', color=redcolor, alpha=0.8) 158 | 159 | 160 | 161 | plt.ylabel(ylabel) 162 | plt.xlabel(xlabel) 163 | if width == 1.0: 164 | plt.xticks(np.arange(int(cbins[0]), cbins[-1], 1)) 165 | plt.gcf().canvas.draw() 166 | if separate_figures: 167 | plt.savefig(outpath(fig, xlabel, '.png')) 168 | plt.close() 169 | 170 | def plot_qvalues(df, fig, subplot_max_x, subplot_i, prefix): 171 | separate_figures = _get_sf(fig) 172 | df1 = df.copy() 173 | if separate_figures: 174 | plt.figure() 175 | else: 176 | fig.add_subplot(subplot_max_x, 3, subplot_i) 177 | df1['shortname'] = df1['dbname'].apply(lambda x: x.replace(prefix, '')) 178 | df1 = df1.sort_values(by='score', ascending=False) 179 | df1 = df1.drop_duplicates(subset='shortname') 180 | df1 = df1[df1['score'] > 0] 181 | qar = aux.qvalues(df1, key='score', is_decoy='decoy', reverse=True, remove_decoy=True, correction=1)['q'] 182 | if sum(qar<=0.01) == 0: 183 | qar = aux.qvalues(df1, key='score', is_decoy='decoy', reverse=True, remove_decoy=True, correction=0)['q'] 184 | 185 | plt.plot(qar*100, np.arange(1, len(qar)+1, 1)) 186 | plt.ylabel('# identified proteins') 187 | plt.xlabel('FDR, %') 188 | plt.vlines(1, 0, len(qar), linestyle='--', color='b') 189 | plt.text(2, len(qar), '1%% FDR: %d proteins' % (sum(qar<=0.01)), ) 190 | plt.vlines(5, 0, len(qar)*2/3, linestyle='--', color='b') 191 | plt.text(6, len(qar)*2/3, '5%% FDR: %d proteins' % (sum(qar<=0.05)), ) 192 | plt.vlines(15, 0, len(qar)/3, linestyle='--', color='b') 193 | plt.text(15, len(qar)/3, '15%% FDR: %d proteins' % (sum(qar<=0.15)), ) 194 | if separate_figures: 195 | plt.savefig(outpath(fig, 'FDR, %', '.png')) 196 | plt.close() 197 | 198 | 199 | def plot_legend(fig, subplot_max_x, subplot_start): 200 | ax = fig.add_subplot(subplot_max_x, 3, subplot_start) 201 | legend_elements = [Patch(facecolor=greencolor, label='Positive IDs'), 202 | Patch(facecolor=bluecolor, label='Targets'), 203 | Patch(facecolor=redcolor, label='Decoys')] 204 | ax.legend(handles=legend_elements, loc='center', prop={'size': 24}) 205 | ax.set_axis_off() 206 | 207 | 208 | def plot_aa_stats(df_f, df, fig, subplot_max_x, subplot_i): 209 | separate_figures = _get_sf(fig) 210 | if separate_figures: 211 | plt.figure() 212 | else: 213 | fig.add_subplot(subplot_max_x, 3, subplot_i) 214 | 215 | # Generate list of 20 standart amino acids 216 | std_aa_list = list(mass.std_aa_mass.keys()) 217 | std_aa_list.remove('O') 218 | std_aa_list.remove('U') 219 | 220 | # Count identified amino acids 221 | aa_exp = Counter() 222 | for pep in set(df_f['sequence']): 223 | for aa in pep: 224 | aa_exp[aa] += 1 225 | 226 | # Count theoretical amino acids 227 | aa_theor = Counter() 228 | for pep in set(df[df['decoy']]['sequence']): 229 | for aa in pep: 230 | aa_theor[aa] += 1 231 | 232 | aa_exp_sum = sum(aa_exp.values()) 233 | aa_theor_sum = sum(aa_theor.values()) 234 | lbls, vals = [], [] 235 | for aa in sorted(std_aa_list): 236 | if aa_theor.get(aa, 0): 237 | lbls.append(aa) 238 | vals.append((aa_exp.get(aa, 0)/aa_exp_sum) / (aa_theor[aa]/aa_theor_sum)) 239 | clrs = [aa_color_1 if abs(x-1) < 0.4 else aa_color_2 for x in vals] 240 | plt.bar(range(len(vals)), vals, color=clrs) 241 | plt.xticks(range(len(lbls)), lbls) 242 | plt.hlines(1.0, range(len(vals))[0]-1, range(len(vals))[-1]+1) 243 | plt.ylabel('amino acid ID rate') 244 | if separate_figures: 245 | plt.savefig(outpath(fig, 'amino acid ID rate', '.png')) 246 | plt.close() 247 | 248 | 249 | def calc_max_x_value(df, df_proteins): 250 | cnt = 24 # number of basic figures 251 | peptide_columns = set(df.columns) 252 | features_list = [] 253 | for feature in features_list: 254 | if feature in peptide_columns: 255 | cnt += 1 256 | return cnt // 3 + (1 if (cnt % 3) else 0) 257 | 258 | 259 | def plot_descriptors_figures(df, df_f, fig, subplot_max_x, subplot_start): 260 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='mass diff'), fig, subplot_max_x, subplot_start, xlabel='precursor mass difference, ppm') 261 | subplot_start += 1 262 | plot_hist_descriptor(get_descriptor_array(df, df_f, dname='RT diff'), fig, subplot_max_x, subplot_start, xlabel='RT difference, min') 263 | subplot_start += 1 264 | separate_figures = _get_sf(fig) 265 | if not separate_figures: 266 | plot_legend(fig, subplot_max_x, subplot_start) 267 | subplot_start += 1 268 | 269 | 270 | def get_bins(inarrays, bin_size_one=False): 271 | tmp = np.concatenate(inarrays) 272 | minv = tmp.min() 273 | maxv = tmp.max() 274 | if bin_size_one: 275 | return np.arange(minv, maxv+1, 1) 276 | else: 277 | return np.linspace(minv, maxv+1, num=100) 278 | 279 | 280 | def get_bins_for_descriptors(inarrays, bin_size_one=False): 281 | tmp = np.concatenate(inarrays) 282 | minv = tmp.min() 283 | maxv = tmp.max() 284 | if len(set(tmp)) <= 15: 285 | return np.arange(minv, maxv + 2, 1.0), 1.0 286 | binsize = False 287 | for inar in inarrays: 288 | binsize_tmp = get_fdbinsize(inar) 289 | if not binsize or binsize > binsize_tmp: 290 | binsize = binsize_tmp 291 | # binsize = get_fdbinsize(tmp) 292 | if binsize < float(maxv - minv) / 300: 293 | binsize = float(maxv - minv) / 300 294 | 295 | if bin_size_one: 296 | binsize = int(binsize) 297 | if binsize == 0: 298 | binsize = 1 299 | 300 | lbin_s = scoreatpercentile(tmp, 1.0) 301 | lbin = minv 302 | if lbin_s and abs((lbin - lbin_s) / lbin_s) > 1.0: 303 | lbin = lbin_s * 1.05 304 | rbin_s = scoreatpercentile(tmp, 99.0) 305 | rbin = maxv 306 | if rbin_s and abs((rbin - rbin_s) / rbin_s) > 1.0: 307 | rbin = rbin_s * 1.05 308 | rbin += 1.5 * binsize 309 | return np.arange(lbin, rbin + binsize, binsize), binsize 310 | 311 | 312 | def get_fdbinsize(data_list): 313 | """Calculates the Freedman-Diaconis bin size for 314 | a data set for use in making a histogram 315 | Arguments: 316 | data_list: 1D Data set 317 | Returns: 318 | optimal_bin_size: F-D bin size 319 | """ 320 | if not isinstance(data_list, np.ndarray): 321 | data_list = np.array(data_list) 322 | isnan = np.isnan(data_list) 323 | data_list = np.sort(data_list[~isnan]) 324 | upperquartile = scoreatpercentile(data_list, 75) 325 | lowerquartile = scoreatpercentile(data_list, 25) 326 | iqr = upperquartile - lowerquartile 327 | optimal_bin_size = 2. * iqr / len(data_list) ** (1. / 3.) 328 | MINBIN = 1e-8 329 | if optimal_bin_size < MINBIN: 330 | return MINBIN 331 | return optimal_bin_size 332 | 333 | 334 | def normalize_fname(s): 335 | return re.sub(r'[<>:\|/?*]', '', s) 336 | 337 | 338 | def outpath(outfolder, s, ext='.png'): 339 | return os.path.join(outfolder, normalize_fname(s) + ext) 340 | 341 | 342 | def plot_outfigures(df, df_peptides, df_peptides_f, base_out_name, df_proteins, df_proteins_f, prefix='DECOY_', separate_figures=False): 343 | if not separate_figures: 344 | fig = plt.figure(figsize=(16, 12)) 345 | dpi = fig.get_dpi() 346 | fig.set_size_inches(3000.0/dpi, 3000.0/dpi) 347 | else: 348 | outfolder = os.path.join(base_out_name + '_figures') 349 | if not os.path.isdir(outfolder): 350 | os.makedirs(outfolder) 351 | fig = outfolder 352 | subplot_max_x = calc_max_x_value(df, df_proteins) 353 | descriptor_start_index = 20 354 | plot_basic_figures(df, fig, subplot_max_x, 1, 'features') 355 | plot_basic_figures(df_peptides_f, fig, subplot_max_x, 8, 'peptides') 356 | df_proteins['sq'] = df_proteins['matched peptides'] / df_proteins['theoretical peptides'] * 100 357 | df_proteins_f['sq'] = df_proteins_f['matched peptides'] / df_proteins_f['theoretical peptides'] * 100 358 | 359 | p_decoy = df_proteins[df_proteins['decoy']]['matched peptides'].sum() / df_proteins[df_proteins['decoy']]['theoretical peptides'].sum() 360 | df_proteins['corrected matched peptides'] = df_proteins['matched peptides'] - p_decoy * df_proteins['theoretical peptides'] 361 | df_proteins_f['corrected matched peptides'] = df_proteins_f['matched peptides'] - p_decoy * df_proteins_f['theoretical peptides'] 362 | df_proteins['corrected sq'] = df_proteins['corrected matched peptides'] / df_proteins['theoretical peptides'] * 100 363 | df_proteins_f['corrected sq'] = df_proteins_f['corrected matched peptides'] / df_proteins_f['theoretical peptides'] * 100 364 | 365 | subplot_current = plot_protein_figures(df_proteins, df_proteins_f, fig, subplot_max_x, 17, prefix) 366 | plot_aa_stats(df_peptides_f, df_peptides, fig, subplot_max_x, subplot_current) 367 | plt.grid(color='#EEEEEE') 368 | plt.tight_layout() 369 | if not separate_figures: 370 | plt.savefig(base_out_name + '.png') 371 | plt.close() 372 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pyteomics>=4.5.1 2 | lxml 3 | scipy 4 | numpy 5 | scikit-learn 6 | lightgbm 7 | pandas 8 | biosaur2 9 | matplotlib 10 | seaborn 11 | ete3 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | ''' 4 | setup.py file for ms1searchpy 5 | ''' 6 | 7 | from setuptools import setup, find_packages 8 | 9 | version = open('VERSION').readline().strip() 10 | 11 | setup( 12 | name = 'ms1searchpy', 13 | version = version, 14 | description = '''A proteomics search engine for LC-MS1 spectra.''', 15 | long_description = (''.join(open('README.md').readlines())), 16 | long_description_content_type = 'text/markdown', 17 | author = 'Mark Ivanov', 18 | author_email = 'pyteomics@googlegroups.com', 19 | install_requires = [line.strip() for line in open('requirements.txt')], 20 | classifiers = ['Intended Audience :: Science/Research', 21 | 'Programming Language :: Python :: 3', 22 | 'Topic :: Education', 23 | 'Topic :: Scientific/Engineering :: Bio-Informatics', 24 | 'Topic :: Scientific/Engineering :: Chemistry', 25 | 'Topic :: Scientific/Engineering :: Physics'], 26 | license = 'License :: OSI Approved :: Apache Software License', 27 | packages = find_packages(), 28 | entry_points = {'console_scripts': ['ms1searchpy = ms1searchpy.search:run', 29 | 'ms1combine = ms1searchpy.combine:run', 30 | 'ms1groups = ms1searchpy.group_specific:run', 31 | 'ms1combine_proteins = ms1searchpy.combine_proteins:run', 32 | 'directms1quant = ms1searchpy.directms1quant:run', 33 | 'directms1quantmulti = ms1searchpy.directms1quantmulti:run', 34 | 'directms1quantDIA = ms1searchpy.directms1quantDIA:run', 35 | 'directms1quantneg = ms1searchpy.directms1quantneg:run', 36 | 'ms1quant = ms1searchpy.directms1quant:run', 37 | 'ms1todiffacto = ms1searchpy.ms1todiffacto:run', 38 | ]} 39 | ) 40 | -------------------------------------------------------------------------------- /test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | TESTDIR="test_data" 4 | if [ "$1" = "--no-deeplc" ]; then 5 | NODEEPLC=true 6 | else 7 | NODEEPLC=false 8 | fi 9 | 10 | if [ ! -d "$TESTDIR" ]; then 11 | echo "You must have the test dataset to run this test." 12 | exit 1 13 | fi 14 | 15 | cd "$TESTDIR" 16 | rm -vf *.features* *.tsv *.txt 17 | 18 | echo "" 19 | echo "Starting ms1searchpy ..." 20 | echo "------------------------" 21 | ms1command="time ms1searchpy -d sprot_ecoli_ups.fasta -ad 1 -nproc 6 -debug " 22 | if ! $NODEEPLC; then ms1command+="-deeplc deeplc -deeplc_library deeplc.lib "; fi 23 | ms1command+="*.mzML" 24 | echo "$ms1command" 25 | eval "$ms1command" && echo "DirectMS1 run successful." 26 | 27 | echo "" 28 | echo "Starting ms1combine ..." 29 | echo "-----------------------" 30 | time ms1combine *_UPS_4_0?.features_PFMs_ML.tsv -out UPS_4 && echo "ms1combine run successful." 31 | time ms1combine *_UPS_2_0?.features_PFMs_ML.tsv -out UPS_2 && echo "ms1combine run successful." 32 | 33 | echo "" 34 | echo "Starting ms1todiffacto ..." 35 | echo "--------------------------" 36 | time ms1todiffacto -dif diffacto -S1 *_UPS_4_0?.features_proteins.tsv -S2 *_UPS_2_0?.features_proteins.tsv \ 37 | -norm median -out diffacto_output.tsv -min_samples 3 -debug && echo "ms1todiffacto run successful." 38 | 39 | echo "" 40 | echo "Starting directms1quant ..." 41 | echo "---------------------------" 42 | time directms1quant -S1 *_UPS_4_0?.features_proteins_full.tsv -S2 *_UPS_2_0?.features_proteins_full.tsv -min_samples 3 \ 43 | && echo "DirectMS1quant run successful." 44 | --------------------------------------------------------------------------------