├── Extract.md
├── Hail.md
├── Local.md
├── README.md
├── Recover.md
├── Rfmix.md
├── Tractor-Hail.ipynb
├── images
    ├── .DS_Store
    ├── AFR.png
    ├── EUR.png
    ├── ExtractTract.png
    ├── Manhattan.png
    ├── SHAPEIT.png
    ├── TractorIcon.png
    ├── TractorModel.png
    ├── inference.png
    └── localancestry.png
└── test_data.zip


/Extract.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Extract Risk Alleles and Local Ancestry Information
  3 | filename: Extract.md
  4 | ---
  5 | 
  6 | ![](images/TractorIcon.png)
  7 | 
  8 | Compared to the standard GWAS model, Tractor takes local ancestry into account and therefore may help improve discovery power when there are (LD, MAF, or effect size) differences across ancestries, as well as better localizing top hits, and providing more accurate effect size estimates. In the [previous step](Rfmix.md), we demonstrated how to perform local ancestry inference using Rfmix. Now, we will process the RFmix output files to generate more interpretable intermediate files which will be used in statistical modeling.
  9 | 
 10 | ## Prerequisites
 11 | 
 12 | #### Download Tractor Scripts
 13 | 
 14 | Tractor scripts are readily available on the Tractor GitHub repository and can be easily cloned. All scripts utilized in this tutorial are accessible in the `scripts/` directory.
 15 | ```
 16 | git clone https://github.com/Atkinson-Lab/Tractor.git
 17 | ```
 18 | Here, we showcase the fundamental functionality of these scripts. However, **we recommend referring to the [README](https://github.com/Atkinson-Lab/Tractor.git) for Tractor to explore additional functionalities beyond those outlined here.**
 19 | 
 20 | #### Verify files in `admixed_cohort` directory
 21 | 
 22 | The previous Phasing step should have generated the following files.
 23 | ```
 24 | ASW.unphased.vcf.gz[.csi]
 25 | ASW.unphased_mod1.vcf.gz[.csi] # Modified to include AC/AN tags
 26 | ASW.phased.vcf.gz[.csi]        # SHAPEIT5 phased data
 27 | ```
 28 | The ASW.phased.vcf.gz file, converted from standard SHAPEIT5 output will serve as an argument for extract_tracts.py.
 29 | 
 30 | Additionally, after performing local ancestry inference with RFMix2, the following files should have been generated:
 31 | ```
 32 | ASW.deconvoluted.fb.tsv
 33 | ASW.deconvoluted.msp.tsv
 34 | ASW.deconvoluted.rfmix.Q
 35 | ASW.deconvoluted.sis.tsv
 36 | ```
 37 | 
 38 | The `ASW.deconvoluted.msp.tsv` file, containing the most probable ancestral assignments for all variants in each individual within the cohort, will be utilized as an argument for `extract_tracts.py`.
 39 | 
 40 | &nbsp;  
 41 | 
 42 | ## Extract Tracts
 43 | 
 44 | We offer a script capable of extracting risk allele information and local ancestry simultaneously.
 45 | ```
 46 | python3 extract_tracts.py \
 47 | --vcf admixed_cohort/ASW.phased.vcf.gz \
 48 | --msp admixed_cohort/ASW.deconvoluted.msp.tsv \
 49 | --num-ancs 2
 50 | ```
 51 | 
 52 | 4 files will be generated:
 53 | ```less
 54 | ASW.phased.anc0.dosage.txt
 55 | ASW.phased.anc0.hapcount.txt
 56 | ASW.phased.anc1.dosage.txt
 57 | ASW.phased.anc1.hapcount.txt
 58 | ```
 59 | 
 60 | #### What happened under the hood?
 61 | 
 62 | Let's take a peek at our input files `ASW.phased.vcf.gz` and `ASW.deconvoluted.msp.tsv`.
 63 | 
 64 | The phased vcf file contains haplotype information for each individual in the cohort. Reference alleles are represented with a 0, alternate alleles (often assumed to be the risk allele) as a 1. Here you can see the first 4 variants in chromosome 22 for SAMPLE1:
 65 | ```
 66 | CHROM POS REF ALT SAMPLE1   ...
 67 | 22    10  A   T   1|0       ...
 68 | 22    93  G   A   0|0       ...
 69 | 22    106 G   A   0|0       ...
 70 | 22    223 G   A   0|1       ...
 71 | ...
 72 | ```
 73 | 
 74 | The msp file contains local ancestry information for each person, for each of their two chromosome strands. Notice its coordinates are in a different format from the vcf file -- the msp file uses a window to specify LA information. For example, the first row of the following msp file is telling us this: for SAMPLE1, strand0 from position 1 to position 52 is inherited from `EUR` ancestry (ancestry 1 in our input files) and strand 1 from position1-position52 is inherited from an `AFR` background (anc0 in our input files). Generally, RFmix orders ancestries alphabetically, so the first will be anc0, the second anc1, etc.
 75 | 
 76 | ```
 77 | AFR=0       EUR=1
 78 | CHROM SPOS  EPOS SAMPLE1.0 SAMPLE1.1   ...
 79 | 22    1     52  1         0       ...
 80 | 22    52    101 1         1       ...
 81 | 22    101   190 0         1       ...
 82 | 22    190   283 0         1       ...
 83 | ...
 84 | ```
 85 | 
 86 | By checking both files simultaneously, we can figure out the local ancestry (LA) background of each of the alleles in our dataset, as the upper portion of the following diagram shows. To represent both LA and risk allele information, we need to recode each variant into 4 columns, with each column represents a unique combination of LA and risk allele (`AFR-nonRisk`, `AFR-Risk`, `EUR-nonRisk`, `EUR-Risk`). In the diagram below, at the first variant, SAMPLE1 has one copy of `AFR-nonRisk`, and one copy of `EUR-Risk`, and therefore will be encoded to **[1,0,0,1]**. 
 87 | 
 88 | To further compress the information, `extract_tracts.py` did an additional step -- talling the total number of copies of `AFR` local ancestry for each variant for each person. This can be thought of as adding `AFR-nonRisk` and `AFR-Risk`. 
 89 | 
 90 | ![](images/ExtractTract.png)
 91 | 
 92 | Running ExtractTracts.py on our toy dataset will generate 4 files, and 3 of them will be used in the next step for statistical modeling in the Tractor GWAS. You can read about the other files on the Tractor Wiki for this script [here](https://github.com/Atkinson-Lab/Tractor/wiki/Step-2:-Extracting-tracts-and-ancestral-dosages).
 93 | ```
 94 | ASW.phased.anc0.hapcount.txt        (used as X1)
 95 | ASW.phased.anc0.dosage.txt          (used as X2)
 96 | ASW.phased.anc1.dosage.txt          (used as X3)
 97 | ```
 98 | 
 99 | Now we are ready to run Tractor GWAS! We have recently developed scripts for Tractor that may be used either on a local computer or in a high-perfomance computing setting (Local Tractor). These are subtlely refined over the originally released cloud native scripts (we fixed a few edge cases/bad behaviors). Should Tractor need to be run on an extremely large dataset, however, we also have developed scabable cloud-native scripts written in Hail. The user may select which of the two versions of Tractor GWAS they wish to run with the links below. For this tutorial, we recommend using Local Tractor.
100 | 
101 | ## [Main Page](README.md)
102 | 
103 | ## [Next Page (Option1: Hail Tractor)](Hail.md) 
104 | 
105 | ## [Next Page (Option2: Local Tractor)](Local.md)  
106 | 


--------------------------------------------------------------------------------
/Hail.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Tractor with Hail
 3 | filename: Hail.md
 4 | ---
 5 | 
 6 | Running Tractor GWAS in a scalable cloud native Hail format
 7 | =====
 8 | 
 9 | ![](images/TractorIcon.png)
10 | 
11 | 
12 | &nbsp;  
13 | &nbsp;  
14 | 
15 | We have prepared a Jupyter Notebook file that you may follow along. Please go to [this page](https://github.com/Atkinson-Lab/Tractor-tutorial/blob/main/Tractor-Hail.ipynb) for Tractor-Hail implementation. 
16 | 
17 | 
18 | 
19 | &nbsp;  
20 | &nbsp;  
21 | 
22 | 
23 | ## [Main Page](README.md)
24 | 


--------------------------------------------------------------------------------
/Local.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Tractor with Local python file
  3 | filename: Local.md
  4 | ---
  5 | 
  6 | ![](images/TractorIcon.png)
  7 | 
  8 | ## Intuition
  9 | 
 10 | Tractor implicitly assumes that the risk alleles that reside in different local ancestry backgrounds may have different marginal effect sizes (e.g. due to LD, MAF, phenotyping differences, or true effect size differences), and therefore performing LAI may provide better biological insights, allows for less-biased effect size estimates, and potentially increased gene discovery power. In this section, we will first compare Tractor with standard GWAS analysis and admixture mapping, and we will then go over the `RunTractor.py` script to run the Tractor method locally or in a HPC setting.
 11 | 
 12 | ## Compare GWAS, Admixture mapping and Tractor 
 13 | 
 14 | In standard GWAS, we use individuals' genotype information but don't take into account local ancestry information. In the following diagram, if we sum up `AFR-Risk` and `EUR-Risk`, and perform a linear/logistic regression, we will recover a standard GWAS input.
 15 | 
 16 | In admixture mapping, we assume local ancestries affect the phenotype but don't take into account the allelic information. In this setting, we look at the number of copies of a specific ancestry background (`AFR` in this example). Summing up `AFR-Risk` and `AFR-noRisk` and performing a linear/logistic regression, we will recover admixture mapping.
 17 | 
 18 | Tractor takes both local ancestry and risk alleles into account and can be viewed as a combined method of GWAS and admixture mapping. In the Tractor model, we have 3 explanatory variables: X1 represents the number of `AFR` local ancestry haplotypes at a variant (identical to the X term used in admixture mapping); X2 represents the number of risk alleles on an AFR background - `AFR-Risk`, X3 represents the number of risk alleles on an EUR background `EUR-Risk`. In this way, we can get ancestry-specific summary statistics as output. In other words, we will observe if risk alleles on an `AFR` background may affect the phenotype differently compared with risk alleles on `EUR`. Such a case of differing marginal effect sizes at GWAS SNPs occur due to different tagging of causal variants depending on the ancestry; LD blocks, polymorphic loci, and allele frequencies differ across ancestries due to different populations' demographic histories.
 19 | 
 20 | ![Methods comparison](images/TractorModel.png)
 21 | 
 22 | ## Run Tractor locally
 23 | 
 24 | We have developed an R script `run_tractor.R` that allows users to run Tractor GWAS in a local machine/high performance computer. Please make sure the following libraries are installed within R
 25 | * optparse
 26 | * data.table
 27 | * R.utils
 28 | * dplyr
 29 | * doParallel
 30 | 
 31 | Alternatively, you can create a conda environment as described on [Tractor README](https://github.com/Atkinson-Lab/Tractor?tab=readme-ov-file#setup-conda-environment). This should automatically take care of all the necessary dependencies.
 32 | 
 33 | For the phenotype file, we expect users to prepare a tab-delimited file which contains sample IDs, phenotype column, and covariate columns. A header column is required, and column names for the aforementioned requirements can be input in the `run_tractor.R` code. Please refer to its Usage or [Documentation here](https://github.com/Atkinson-Lab/Tractor?tab=readme-ov-file#step-2-running-tractor).
 34 | 
 35 | This is one example of how a phenotype file can be defined:
 36 | ```
 37 | IID	y	COV	...
 38 | SAMPLE1	1	0.32933076689022	...
 39 | SAMPLE2	1	1.42015135263198	...
 40 | SAMPLE3	0	1.57285800531035	...
 41 | SAMPLE4	0	-0.303899298288934	...
 42 | SAMPLE5	1	0.491130537379353	...
 43 | ...	...	...	...
 44 | ```
 45 | 
 46 | * Please make sure all columns (expect the first column) of `Phe.txt` were encoded as numerical type. We currently do not automatically convert string to numbers.
 47 | * Please include header in `Phe.txt`
 48 | 
 49 | 
 50 | To perform linear regression on a (simulated) continuous phenotype in our admixed cohort, simply type the following command in your terminal:
 51 | 
 52 | ```
 53 | run_tractor.R \
 54 | --hapdose admixed_cohort/ASW.phased \
 55 | --phenofile phenotype/Phe.txt \
 56 | --covarcollist none \
 57 | --method linear \
 58 | --output ASW_sumstats.txt
 59 | 
 60 | # Optional Arguments (not needed for here)
 61 | ## --sampleidcol
 62 | ## --phenocol
 63 | ## --chunksize (no. of rows to read at once, Default: 10000)
 64 | ## --nthreads  (no. of threads to use)
 65 | ```
 66 | 
 67 | This generates our summary statistic output, which has the estimated ancestry-specific p value and effect sizes for each locus for each ancestry. More specifically, you shall see the following columns in the summary statistics:
 68 | ```
 69 | CHR             Chromosome 
 70 | POS             Position 
 71 | ID              SNP ID
 72 | REF             Reference allele
 73 | ALT             Alternate allele
 74 | N               Total sample size
 75 | 
 76 | AF_anc0         Allele frequency for anc0; sum(dosage)/sum(local ancestry)
 77 | LAprop_anc0     Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
 78 | beta_anc0       Effect size for alternate alleles inherited from anc0
 79 | se_anc0         Standard error for effect size (beta_anc0)
 80 | pval_anc0       p-value for alternate alleles inherited from anc0 (NOT -log10(pvalues))
 81 | tval_anc0       t-value for anc0
 82 | 
 83 | AF_anc1         Allele frequency for anc1; sum(dosage)/sum(local ancestry)
 84 | LAprop_anc1     Local ancestry proportion for anc1; sum(local ancestry)/2 * sample size
 85 | beta_anc1       Effect size for alternate alleles inherited from anc1
 86 | se_anc1         Standard error for effect size (beta_anc1)
 87 | pval_anc1       p-value for alternate alleles inherited from anc1 (NOT -log10(pvalues))
 88 | tval_anc1       t-value for anc1
 89 | 
 90 | LApval_anc0     p-value for the local ancestry term (X1 term in Tractor)
 91 | LAeff_anc0      Effect size for the local ancestry term (X1 term in Tractor)
 92 | ```
 93 | 
 94 | &nbsp;  
 95 | &nbsp;  
 96 | To visualize Tractor GWAS results, we may use the R package [qqman](https://cran.r-project.org/web/packages/qqman/vignettes/qqman.html) to draw a manhattan plot of our sumstats, and indeed we see a top hit in the AFR with some LD friends. Note in particular that we have a hit in only the AFR ancestry background in this example, no signal is observed in the EUR plot. We recommend plotting QQ plots with the lambda GC values shown in addition to your Manhattan plots to ensure your GWAS was well controlled.
 97 | 
 98 | ```
 99 | library(qqman)
100 | sumstats = read.csv("SumStats.tsv", sep = "\t")
101 | 
102 | # Assuming anc0 is AFR and anc1 is EUR. Please check LAI output to identify ancestry classifications.
103 | par(mfrow=c(1,2))
104 | manhattan(sumstats[!is.na(sumstats$Gpval_anc0),], chr="CHROM", bp="POS", snp="ID", p="pval_anc0",
105 |           xlim = c(min(sumstats$POS), max(sumstats$POS)), ylim = c(0,15), main = "AFR")
106 | manhattan(sumstats[!is.na(sumstats$Gpval_anc1),], chr="CHROM", bp="POS", snp="ID", p="pval_anc1",
107 |           xlim = c(min(sumstats$POS), max(sumstats$POS)), ylim = c(0,15), main = "EUR")
108 | ```
109 | 
110 | ![](images/Manhattan.png)
111 | 
112 | 
113 | ## [Main Page](README.md)
114 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![](images/TractorIcon.png)
 2 | 
 3 | ## Tractor - A GWAS tool for admixed population 
 4 | 
 5 | Welcome to Tractor!
 6 | 
 7 | For a comprehensive understanding of Tractor's methodology and application, we invite you to read our following **2021 Nature Genetics publication** [here](https://www.nature.com/articles/s41588-020-00766-y). 
 8 | 
 9 | > Atkinson, E.G., Maihofer, A.X., Kanai, M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 53, 195–204 (2021).
10 | 
11 | To facilitate users in implementing the method, we've curated a series of tutorials available here. These tutorials cover each step of the Tractor pipeline using a toy dataset. Please [download](https://github.com/Atkinson-Lab/Tractor-tutorial/blob/main/tutorial-data.zip) and unzip the provided file with the toy data to follow along.
12 | 
13 | Please note that while this tutorial offers a high-level guide to running Tractor, users are advised to ensure the reliability and accuracy of phasing and local ancestry inference, as these factors influence the Tractor GWAS results.
14 | 
15 | 
16 | ## Tractor pipeline step-by-step tutorials:
17 | 
18 | - **Step 0**: [Phasing and Local Ancestry Inference](Rfmix.md)
19 | - **Step 1**: [Recovering Tracts](Recover.md)
20 | - **Step 2**: [Extracting tracts and ancestry dosages](Extract.md)
21 | - **Step 3a**: [Tractor GWAS with local implementation (suitable for smaller dataset)](Local.md)
22 | - **Step 3b**: [Tractor GWAS with Hail (suitable for large scale analysis)](Hail.md)
23 | 


--------------------------------------------------------------------------------
/Recover.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Recover Tracts (optional) (unfinished)
 3 | filename: Recover.md
 4 | ---
 5 | 
 6 | ![](images/TractorIcon.png)
 7 | 
 8 | **Note - this optional step currently is only implemented for two-way admixed populations. If the end goal of analyses does not require long-range haplotypes, this step may be skipped for faster compute time**
 9 | 
10 | This page is devoted to describing our scripts to detect and correct switch errors in phased data using local ancestry to recover long-range tracts. See Figure 1 in our manuscript for additional context. A switch error is defined as a swap in ancestries within a 1 cM window to opposite strands conditioned on heterozygous ancestral dosage. Currently, the tract recovery step is only implemented for a 2-way admixed setting. All subsequent steps are compatible with multi-way admixed cohorts. Tract recovery is not required for analyses that do not consider haplotypes, including GWAS.
11 | 
12 | &nbsp;  
13 | &nbsp;  
14 | 
15 | ## Identifying and correcting switch errors in local ancestry calls
16 | 
17 | The first (optional) step is to detect strand flips in the data by utilizing local ancestry information. This step uses RFmix ancestry calls as input and is implemented with the script UnkinkMSPfile.py, which tracks switch locations and corrects them. Output consists of 2 files: a text file documenting switch locations for each individual (input msp file name suffixed with "switches") and a corrected local ancestry file (input msp file name suffixed with "unkinked"), as in unkinking a garden hose. This first step recovers long range tracts which are disrupted by statistical phasing using the local ancestry calls and tracks the location of the identified strand switches.
18 | 
19 | Example usage: python UnkinkMSPfile.py --msp FILENAME_STEM
20 | 
21 | &nbsp;  
22 | &nbsp;  
23 | 
24 | ## Correcting switch errors in genotype data
25 | 
26 | 
27 | Next we also need to correct switch errors in the phased genotype file (VCF format). This recovers fuller haplotypes and improves the long-range tract distribution. This step is implemented with the script UnkinkGenofile.py and expects as input the phased VCF file that was fed into RFmix and the 'switches' file generated with the previous step. This switches file is used to determine the positions that need to be flipped in the VCF file.
28 | 
29 | Example usage: python UnkinkGenofile.py --switches SWITCHES_FILE --genofile INPUT_VCF
30 | 
31 | Notes: Stem filenames will be appended in each step. The first step expects the standard .msp.tsv extension from RFMix ancestry calls as input. The second step expects a VCF file with the .vcf extension.
32 | 
33 | Tractor expects all VCF files to be phased (i.e. genotypes contain pipes rather than slashes), and it is recommended to strip their INFO and FORMAT annotations prior to running to ensure good parsing. This can be accomplished with bcftools on bgzipped and tabix indexed vcfs:
34 | 
35 | 
36 | ```
37 | bgzip file.vcf
38 | tabix -p vcf file.vcf.gz
39 | bcftools annotate -x INFO,FORMAT file.vcf.gz > stripped_file.vcf
40 | ```
41 | 
42 | 
43 | 
44 | ## To Do List
45 | 
46 | This page is directly copied from Tractor wiki page. Further editing is required.
47 | 
48 | ## [Main Page](README.md) 
49 | 
50 | ## [Next Page (Extract Tracts)](Extract.md)      
51 | 
52 | 
53 | 
54 | 


--------------------------------------------------------------------------------
/Rfmix.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Local Ancestry Deconvolution
  3 | filename: Rfmix.md
  4 | ---
  5 | 
  6 | ![](images/TractorIcon.png) 
  7 | 
  8 | # Step 0: Phasing and Local Ancestry inference
  9 | 
 10 | Before running the Tractor GWAS method, data will need to be phased and have their local ancestry inferred. To assist users with these precursor steps, we provide example code for running the full pipeline including these preliminary steps. In this tutorial, we do expect our input data to be unphased, but already QC'd. This may not be the case for your dataset. In addition, the flags used in these steps are simply presented to highlight one working example, however, you might require a different set of flags for your run depending on the dataset. We recommend reviewing the documentation of respective tools and making sure Phasing and Local Ancestry Inference (LAI) performs well.
 11 | 
 12 | ### Data
 13 | Download and unzip this [example dataset](https://github.com/Atkinson-Lab/Tractor-tutorial/blob/main/tutorial-data.zip) to follow along the tutorial.
 14 | 
 15 | ```
 16 | curl -O -L https://github.com/Atkinson-Lab/Tractor-tutorial/raw/main/tutorial-data.zip
 17 | unzip tutorial-data.zip
 18 | ```
 19 | 
 20 | The example cohort dataset we are going to use here consists of chromosome 22 for 61 African American individuals from the [Thousand Genome Project](https://www.internationalgenome.org/). These individuals are two-way admixed with components from continental Africa (AFR) and European (EUR) ancestries. We simulated phenotypes for these individuals for use in the GWAS.
 21 | 
 22 | We've acquired a haplotype reference panel from the [SHAPEIT website](https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated_SHAPEIT2_16-06-14.html) solely for demonstration purposes. However, we utilize more up-to-date references for our own analyses.
 23 | 
 24 | References play a crucial role in Phasing and Local Ancestry Inference (LAI), so we strongly recommend ensuring that you're using the most suitable references for your analysis. For guidance on selecting appropriate references, please consult the software websites for detailed instructions.
 25 | 
 26 | Here is a complete list of the files:
 27 | ```
 28 | tutorial-data
 29 | ├── admixed_cohort
 30 | │   ├── ASW.unphased.vcf.gz
 31 | │   └── ASW.unphased.vcf.gz.csi
 32 | ├── phenotype
 33 | │   ├── Phe.txt
 34 | │   └── Phe2.txt
 35 | └── references
 36 |     ├── chr22.b37.gmap.gz
 37 |     ├── chr22.genetic_map.modified.txt
 38 |     ├── TGP_HGDP_QC_hg19_chr22.vcf.gz
 39 |     ├── TGP_HGDP_QC_hg19_chr22.vcf.gz.csi
 40 |     └── YRI_GBR_samplemap.txt
 41 | ```
 42 | 
 43 | ### Softwares used
 44 | 
 45 | * Phasing using [SHAPEIT5](https://odelaneau.github.io/shapeit5/)
 46 | * Local Ancestry Inference using [RFMix2](https://github.com/slowkoni/rfmix)
 47 | * Ancestry-Specific GWAS using [Tractor](https://github.com/Atkinson-Lab/Tractor)
 48 | 
 49 | ## Statistical Phasing using SHAPEIT5
 50 | 
 51 | Cohort genotype data is commonly provided in [VCF format](https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/). However, despite the human genome being diploid, sequencing/genotyping technology can only reveal genotype presence, not their orientation (haplotype). This means we lack information on which allele resides on which chromosome strand. Statistical phasing aims to reconstruct allele configurations along a chromosome, as depicted in the diagram.
 52 | 
 53 | ![](images/SHAPEIT.png)
 54 | 
 55 | Notice that each entry is separated with slash (e.g. `0/1`), and that means the VCF file is unphased. By performing phasing, we will figure out the most likely configuration of allele positions. After performing the following steps, we should get a phased VCF file, with the slash substituted by a vertical bar (e.g. `1|0`) which indicates that the data is phased.
 56 | 
 57 | Several tools has been developed for statistical phasing, here we use [SHAPEIT5](https://odelaneau.github.io/shapeit5/) to perform phasing using a reference haplotype panel.
 58 | 
 59 | As previously mentioned, the following steps and flags are only indicated here for demonstration purposes, and does not serve to provide a comprehensive use of SHAPEIT5. For example, phasing can be performed without SHAPEIT5 as well.
 60 | 
 61 | Please refer to the SHAPEIT5 manual [here](https://odelaneau.github.io/shapeit5/docs/documentation/phase_common/) to better understanding the functionalities about SHAPEIT5. 
 62 | 
 63 | #### A. Adding INFO/AC and INFO/AN tags required by SHAPEIT5 pre-phasing
 64 | 
 65 | ```
 66 | # Add AC/AN tags in input file
 67 | bcftools +fill-tags admixed_cohort/ASW.unphased.vcf.gz \
 68 | -Oz \
 69 | -o admixed_cohort/ASW.unphased_mod1.vcf.gz \
 70 | -- -t AN,AC
 71 | 
 72 | # Index file
 73 | bcftools index admixed_cohort/ASW.unphased_mod1.vcf.gz
 74 | ```
 75 | 
 76 | #### B. Perform SHAPEIT5 phasing
 77 | 
 78 | This step aims to phase the common variants in the input file using the reference panel.
 79 | The required static bins (`phase_common_static`) for SHAPEIT5 can be downloaded [here](https://github.com/odelaneau/shapeit5/releases/tag/v5.1.1).
 80 | 
 81 | The `--thread` argument is optional and should be adjusted based on the number of cores available on your system. This is a computationally expensive step, especially in cases of real datasets, and thus multi-threading is recommended.
 82 | 
 83 | ```       
 84 | phase_common_static \
 85 | --input admixed_cohort/ASW.unphased_mod1.vcf.gz \
 86 | --reference references/TGP_HGDP_QC_hg19_chr22.vcf.gz \
 87 | --region 22 \
 88 | --map references/chr22.b37.gmap.gz \
 89 | --output admixed_cohort/ASW.phased.bcf \
 90 | --thread 8
 91 | 
 92 | # Convert output BCF to VCF and index it
 93 | bcftools convert \
 94 | -Oz \
 95 | -o admixed_cohort/ASW.phased.vcf.gz \
 96 | admixed_cohort/ASW.phased.bcf
 97 | 
 98 | bcftools index \
 99 | admixed_cohort/ASW.phased.vcf.gz
100 | 
101 | # Delete redundant file
102 | rm admixed_cohort/ASW.phased.bcf admixed_cohort/ASW.phased.bcf.csi
103 | ```
104 | 
105 | The program will display critical information regarding input and output files, including the number of variants present, variants removed due to lack of overlap between the main input and reference panel, and the number of identified samples.
106 | 
107 | ```less      
108 | Reading genotype data:
109 |   * VCF/BCF scanning done (13.13s)
110 |       + Variants [#sites=182525 / region=22]
111 |          - 34621 sites removed in main panel [not in reference panel]
112 |          - 11249 sites removed in reference panel [not in main panel]
113 |       + Samples [#target=61 / #reference=1572]
114 | ```
115 | 
116 | Note: If using a reference panel, ensure that your panel is phased. If you intend to use an unphased reference panel, consider first running SHAPEIT5 phasing on the reference panel or joint-phasing your cohort and reference populations in a combined file.
117 | 
118 | SHAPEIT5 users have the option to phase the dataset without a reference panel. The necessity of a reference panel depends on your input cohort and the potential value it may add. Additional information can be found on the [SHAPEIT5 website](https://odelaneau.github.io/shapeit5/).
119 | 
120 | &nbsp;
121 | 
122 | ## Local Ancestry Inference
123 | 
124 | To conduct local ancestry inference, a homogeneous phased reference panel representing relevant ancestries and a phased admixed cohort VCF file are required. Admixed populations exhibit chromosomes composed of mosaic ancestral tracts, as different chromosomal segments are inherited from multiple ancestries. The length of these tracts is influenced by the demographic history of the population, with shorter tracts resulting from more recent admixture events, as recombination breaks down tracts over generations.
125 | 
126 | For instance, in a two-way admixed AFR-EUR population, illustrated below, the first generation inherits one full chromosome copy from each ancestry. Subsequent generations undergo recombination during meiosis, resulting in chromosomes fragmented into smaller ancestral segments.
127 | 
128 | ![](images/localancestry.png)
129 | 
130 | The objective of local ancestry inference is to assign ancestry origins to each genomic segment within individuals' chromosomes. In the subsequent analysis of the Tractor pipeline, local ancestry information facilitates a deeper understanding of genotype-phenotype associations.
131 | 
132 | ![](images/inference.png)
133 | 
134 | 
135 | For local ancestry inference, we utilize [RFMix2](https://github.com/slowkoni/rfmix/blob/master/MANUAL.md), known for its accuracy in two-way AFR/EUR admixed populations. RFMix2 requires the following parameters and generates four files (`ASW.deconvoluted.fb.tsv`, `ASW.deconvoluted.msp.tsv`, `ASW.deconvoluted.rfmix.Q`, `ASW.deconvoluted.sis.tsv`), most of which are utilized in downstream analysis. It is advisable to validate the performance of local ancestry inference by reviewing the global ancestry fractions provided in `ASW.deconvoluted.rfmix.Q` to ensure alignment with expected global ancestry proportions in your target cohort.
136 | 
137 | ```
138 | rfmix \
139 | -f admixed_cohort/ASW.phased.vcf.gz \
140 | -r references/TGP_HGDP_QC_hg19_chr22.vcf.gz \
141 | -m references/YRI_GBR_samplemap.txt \
142 | -g references/chr22.genetic_map.modified.txt \
143 | -o admixed_cohort/ASW.deconvoluted \
144 | --chromosome=22
145 | ```
146 | 
147 | 
148 | Now that you have your local ancestry calls, you're all set for Tractor! In the [next post](Recover.md), we'll illustrate how you can optionally enhance phasing by correcting switch errors that typically occur in phasing. Alternatively, if your objective primarily involves understanding variant-level tests (e.g., Tractor GWAS), you can proceed directly to [Step 2: Extract tracts](Extract.md).
149 | 
150 | 
151 | ## [Main Page](README.md)
152 | 
153 | ## [Next Page (optional: Recover phasing)](Recover.md)
154 | 
155 | ## [Next Page (Extract tracts)](Extract.md)
156 | 
157 | 


--------------------------------------------------------------------------------
/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/.DS_Store


--------------------------------------------------------------------------------
/images/AFR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/AFR.png


--------------------------------------------------------------------------------
/images/EUR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/EUR.png


--------------------------------------------------------------------------------
/images/ExtractTract.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/ExtractTract.png


--------------------------------------------------------------------------------
/images/Manhattan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/Manhattan.png


--------------------------------------------------------------------------------
/images/SHAPEIT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/SHAPEIT.png


--------------------------------------------------------------------------------
/images/TractorIcon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/TractorIcon.png


--------------------------------------------------------------------------------
/images/TractorModel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/TractorModel.png


--------------------------------------------------------------------------------
/images/inference.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/inference.png


--------------------------------------------------------------------------------
/images/localancestry.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/images/localancestry.png


--------------------------------------------------------------------------------
/test_data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Atkinson-Lab/Tractor-tutorial/4be82c1ae23a5373e13a620608d0a8c5b31b26c4/test_data.zip


--------------------------------------------------------------------------------