|
|
| We used a multi-step computational pipeline for INDEL identification. |
| 1. Vector screening using NCBI VecScreen system and trimming based on quality score. |
| Traces are trimmed when scores are below Phred 25 for five bases in a row. |
| Moreover, trimmed length has to be at least 100 bases with an average score greater than |
| or equal to Phred 25. |
| 2. Repeatmask traces using Repeatmasker and Maskeraid. |
| 3. Megablast to Golden Path Build 35 sequence with these options: -q 100 p 95 F F |
| 4. For each trace that hits the Golden Path, identify anchor sequence with a minimum of |
| 50 bases and 100% match to a single location. Based on the anchor sequence, traces are |
| then unmasked and aligned against the mapped locations using Bl2seq (NCBI). INDELs up |
| to 16 bases in length are recorded from this analysis if they are flanked on both sides |
| by five bases with Phred quality scores of 25 or greater. |
| 5. When a mismatch is shown in the beginning or end of a trace, and the number of mismatched |
| bases is greater than or equal to 10, a special computer program (FindMatch) is used to look |
| for matching bases upstream or downstream. If a match is found (at least 95% identity), |
| and the surrounding 5 bases on both sides have a quality score of 25 or more, an INDEL is recorded. |
| 6. Identified INDELs were mapped, where possible, to the completed Golden Path Build 1 Version 1 |
| chimp sequence to identify the ancestral allele. |
| 7. Double-hit status was determined for each INDEL on the basis of the trace allele matching either |
| the ancestral chimp allele or at least one other trace allele. Identified single base INDELs |
| without double-hit status were discarded. |
| 8. Traces used with this method were generated from the Baylor and Whitehead Genome Centers |
| as part of an effort to identify SNPs in the human genome. We obtained 8,278,155 of these |
| traces from the Trace DB archive at NCBI. The DNA samples used to generate these traces |
| were generated from eight humans of African American decent as described for method WI-WGS-200306. |