This GRCh38 Genome Specific README.txt file was generated on 20200224 by Jennifer McDaniel ------------------- GENERAL INFORMATION ------------------- Title of Dataset: GRCh38 Genome Specific Stratification BED files Author Information (Jennifer McDaniel, NIST, jennifer.mcdaniel@nist.gov) Principal Investigator: Justin Zook (jzook@nist.gov) Dataset Contact(s): Justin Zook (new genome specific stratifications for GRCh37 and GRCh38) and Jennifer McDaniel (remapping) Date of data collection: 20150101-20200218 ---------------------- Stratification Summary ---------------------- These GRCh38 Genome Specific BED files from the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team and the Genome in a Bottle Consortium. These files are intended as standard resource BED files for use in stratifying true positive, false positive, and false negative variant calls and were developed to identify difficult regions due to potentially difficult variation in NIST/GIAB sample, including (1) regions containing putative compound heterozygous variants, (2) regions containing multiple variants within 50bp of each other, (3) regions with potential structural variation and copy number variation. These files can be used as standard resource of BED files for use with GA4GH benchmarking tools such as hap.py to stratify The initial GRCh37 files were created by Justin Zook at the National Institute of Standards and Technology. The GRCh38 complex/compound/SVs BED files presented here are remapped from the GRCh37 complex/compound/SVs BED files. -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: CC0 Recommended citation for the data: Krusche, P., Trigg, L., Boutros, P.C. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 37, 555–560 (2019). https://doi.org/10.1038/s41587-019-0054-x Links to other publicly accessible locations of the data: GIAB FTP URL: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/ - Individual stratification BED files - stratification README - .tsv for benchmarking with hap.py - MD5 checksums GitHub URL: https://github.com/genome-in-a-bottle/genome-stratifications/ - stratification README - scripts used to generate and evaluate stratification BED files - .tsv for benchmarking with hap.py - MD5 checksums Data.nist.gov DOI: https://doi.org/10.18434/M32190 - Individual stratification BED files - stratification README - scripts used to generate and evaluate stratification BED files - .tsv for benchmarking with hap.py - MD5 checksums -------------------- DATA & FILE OVERVIEW -------------------- File List: Following file generation output files from scripts were renamed using a common convention. Below is a crosswalk of the output filename from the file generation scripts followed by the renamed filename in v2.0 release. HG001 (NA12878) GRCh38_remapped_NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_geno2haplo_compoundhet_slop50.bed.gz GRCh38_HG001_GIABv3.2.2_compoundhet_slop50.bed.gz GRCh38_remapped_NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_varswithin50bp.bed.gz GRCh38_HG001_GIABv3.2.2_varswithin50bp.bed.gz GRCh38_remapped_HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_comphetindel10bp_slop50.bed.gz GRCh38_HG001_GIABv3.3.2_comphetindel10bp_slop50.bed.gz GRCh38_remapped_HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_comphetsnp10bp_slop50.bed.gz GRCh38_HG001_GIABv3.3.2_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_HG001_genomespecific_complexandSVs_v3.3.2_PG_RTG.bed.gz GRCh38_HG001_GIABv3.3.2_complexandSVs.bed.gz GRCh38_remapped_HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_complexindel10bp_slop50.bed.gz GRCh38_HG001_GIABv3.3.2_complexindel10bp_slop50.bed.gz HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz GRCh38_HG001_GIABv3.3.2_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz notinHG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz GRCh38_HG001_GIABv3.3.2_RTG_PG_v3.3.2_SVs_notin_alldifficultregions.bed.gz GRCh38_remapped_HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_snpswithin10bp_slop50.bed.gz GRCh38_HG001_GIABv3.3.2_snpswithin10bp_slop50.bed.gz GRCh38_remapped_PacBio_MetaSV_mergedSVs.bed.gz GRCh38_HG001_PacBio_MetaSV.bed.gz GRCh38_remapped_PG2016-1.0_NA12878_b37_comphetindel10bp_slop50.bed.gz GRCh38_HG001_PG2016-1.0_comphetindel10bp_slop50.bed.gz GRCh38_remapped_PG2016-1.0_NA12878_b37_comphetsnp10bp_slop50.bed.gz GRCh38_HG001_PG2016-1.0_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_PG2016-1.0_NA12878_b37_complexindel10bp_slop50.bed.gz GRCh38_HG001_PG2016-1.0_complexindel10bp_slop50.bed.gz GRCh38_remapped_PG2016-1.0_NA12878_b37_snpswithin10bp_slop50.bed.gz GRCh38_HG001_PG2016-1.0_snpswithin10bp_slop50.bed.gz GRCh38_remapped_sp_v37.7.3.NA12878_comphetindel10bp_slop50.bed.gz GRCh38_HG001_RTG_37.7.3_comphetindel10bp_slop50.bed.gz GRCh38_remapped_sp_v37.7.3.NA12878_comphetsnp10bp_slop50.bed.gz GRCh38_HG001_RTG_37.7.3_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_sp_v37.7.3.NA12878_complexindel10bp_slop50.bed.gz GRCh38_HG001_RTG_37.7.3_complexindel10bp_slop50.bed.gz GRCh38_remapped_sp_v37.7.3.NA12878_snpswithin10bp_slop50.bed.gz GRCh38_HG001_RTG_37.7.3_snpswithin10bp_slop50.bed.gz HG002 (AJ son) GRCh38_union_HG002_CCS_15kb_20kb_merged_ONT_1000_window_size_combined_elliptical_outlier_threshold.bed.gz GRCh38_HG002_GIABv4.1_CNV_CCSandONT_elliptical_outlier.bed.gz GRCh38_mrcanavar_intersect_ccs_1000_window_size_cnv_threshold_intersect_ont_1000_window_size_cnv_threshold.bed.gz GRCh38_HG002_GIABv4.1_CNV_mrcanavarIllumina_CCShighcov_ONThighcov_intersection.bed.gz expanded_150_GRCh38_remapped_HG002_SVs_Tier1plusTier2_v0.6.1.bed.gz GRCh38_HG002_expanded_150__Tier1plusTier2_v0.6.1.bed.gz GRCh38_remapped_HG002_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_CHROM1-22_v3.2.2_highconf_geno2haplo_compoundhet_slop50.bed.gz GRCh38_HG002_GIABv3.2.2_compoundhet_slop50.bed.gz GRCh38_remapped_HG002_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_CHROM1-22_v3.2.2_highconf_varswithin50bp.bed.gz GRCh38_HG002_GIABv3.2.2_varswithin50bp.bed.gz GRCh38_remapped_HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased_comphetindel10bp_slop50.bed.gz GRCh38_HG002_GIABv3.3.2_comphetindel10bp_slop50.bed.gz GRCh38_remapped_HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased_comphetsnp10bp_slop50.bed.gz GRCh38_HG002_GIABv3.3.2_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_HG002_genomespecific_complexandSVs_v3.3.2.bed.gz GRCh38_HG002_GIABv3.3.2_complexandSVs.bed.gz HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG002_GIABv3.3.2_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased_complexindel10bp_slop50.bed.gz GRCh38_HG002_GIABv3.3.2_complexindel10bp_slop50.bed.gz notinHG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG002_GIABv3.3.2_notin_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased_snpswithin10bp_slop50.bed.gz GRCh38_HG002_GIABv3.3.2_snpswithin10bp_slop50.bed.gz GRCh38_HG002_genomespecific_CNVsandSVs_v4.1.bed.gz GRCh38_HG002_GIABv4.1_CNVsandSVs.bed.gz HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetindel10bp_slop50.bed.gz GRCh38_HG002_GIABv4.1_comphetindel10bp_slop50.bed.gz HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetsnp10bp_slop50.bed.gz GRCh38_HG002_GIABv4.1_comphetsnp10bp_slop50.bed.gz HG002_genomespecific_complexandSVs_v4.1.bed.gz GRCh38_HG002_GIABv4.1_complexandSVs.bed.gz HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz GRCh38_HG002_GIABv4.1_complexandSVs_alldifficultregions.bed.gz HG002_GRCh38_1_22_v4.1_draft_benchmark_complexindel10bp_slop50.bed.gz GRCh38_HG002_GIABv4.1_complexindel10bp_slop50.bed.gz notinHG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz GRCh38_HG002_GIABv4.1_notin_complexandSVs_alldifficultregions.bed.gz HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed.gz GRCh38_HG002_GIABv4.1_othercomplexwithin10bp_slop50.bed.gz HG002_GRCh38_1_22_v4.1_draft_benchmark_snpswithin10bp_slop50.bed.gz GRCh38_HG002_GIABv4.1_snpswithin10bp_slop50.bed.gz GRCh38_remapped_HG002_HG003_HG004_allsvs_merged.bed.gz GRCh38_HG002_HG003_HG004_allsvs.bed.gz HG2_SKor_TrioONTCanu_intersect_HG2_SKor_TrioONTFlye_intersect_HG2_SKor_CCS15_gt10kb_GRCh38.bed.gz GRCh38_HG002_GIABv4.1_CNV_gt2assemblycontigs_ONTCanu_ONTFlye_CCSCanu.bed.gz SVMergeInversions.GRCh38.120519.clustered_slop150_chr1_22.bed.gz GRCh38_HG002_GIABv4.1_inversions_slop25percent.bed.gz GRCh38_remapped_HG002_SVs_Tier1plusTier2_v0.6.1.bed.gz GRCh38_HG002_Tier1plusTier2_v0.6.1.bed.gz HG003 (AJ Father) GRCh38_remapped_HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_comphetindel10bp_slop50.bed.gz GRCh38_HG003_GIABv3.3.2_comphetindel10bp_slop50.bed.gz GRCh38_remapped_HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_comphetsnp10bp_slop50.bed.gz GRCh38_HG003_GIABv3.3.2_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_HG003_genomespecific_complexandSVs_v3.3.2.bed.gz GRCh38_HG003_GIABv3.3.2_complexandSVs.bed.gz HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG003_GIABv3.3.2_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_complexindel10bp_slop50.bed.gz GRCh38_HG003_GIABv3.3.2_complexindel10bp_slop50.bed.gz notinHG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG003_GIABv3.3.2_notin_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_snpswithin10bp_slop50.bed.gz GRCh38_HG003_GIABv3.3.2_snpswithin10bp_slop50.bed.gz HG004 (AJ Mother) GRCh38_remapped_HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_comphetindel10bp_slop50.bed.gz GRCh38_HG004_GIABv3.3.2_comphetindel10bp_slop50.bed.gz GRCh38_remapped_HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_comphetsnp10bp_slop50.bed.gz GRCh38_HG004_GIABv3.3.2_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_HG004_genomespecific_complexandSVs_v3.3.2.bed.gz GRCh38_HG004_GIABv3.3.2_complexandSVs.bed.gz HG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG004_GIABv3.3.2_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_complexindel10bp_slop50.bed.gz GRCh38_HG004_GIABv3.3.2_complexindel10bp_slop50.bed.gz notinHG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG004_GIABv3.3.2_notin_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_snpswithin10bp_slop50.bed.gz GRCh38_HG004_GIABv3.3.2_snpswithin10bp_slop50.bed.gz HG005 (Chinese Son) GRCh38_remapped_HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_comphetindel10bp_slop50.bed.gz GRCh38_HG005_GIABv3.3.2_comphetindel10bp_slop50.bed.gz GRCh38_remapped_HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_comphetsnp10bp_slop50.bed.gz GRCh38_HG005_GIABv3.3.2_comphetsnp10bp_slop50.bed.gz GRCh38_remapped_HG005_genomespecific_complexandSVs_v3.3.2.bed.gz GRCh38_HG005_GIABv3.3.2_complexandSVs.bed.gz HG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG005_GIABv3.3.2_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_complexindel10bp_slop50.bed.gz GRCh38_HG005_GIABv3.3.2_complexindel10bp_slop50.bed.gz notinHG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz GRCh38_HG005_GIABv3.3.2_notin_complexandSVs_alldifficultregions.bed.gz GRCh38_remapped_HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_snpswithin10bp_slop50.bed.gz GRCh38_HG005_GIABv3.3.2_snpswithin10bp_slop50.bed.gz GRCh38_remapped_HG005_HG006_HG007_FB_GATKHC_CG_MetaSV_allsvs_merged.bed.gz GRCh38_HG005_HG006_HG007_MetaSV_allsvs.bed.gz File descriptions: - *comphetindel10bp*.bed: Regions containing at least one variant on each haplotype within 10bp of each other, and at least one of the variants is an indel, with 50bp slop added on each side - *comphetsnp10bp*.bed: Regions containing at least one variant on each haplotype within 10bp of each other, and all variants are snps, with 50bp slop added on each side - *complexindel10bp*.bed: Regions containing at least two variants on one haplotype within 10bp of each other, and at least one of the variants is an indel, with 50bp slop added on each side - *complexindel10bp*.bed: Regions containing at least two variants on one haplotype within 10bp of each other, and all variants are snps, with 50bp slop added on each side - *othercomplexwithin10bp*.bed: Any other regions containing at least two variants within 10bp of each other, with 50bp slop added on each side - *complexandSVs.bed: union of all above bed files and SV and CNV bed files used for each version of the benchmark for each sample - *complexandSVs_alldifficultregions.bed: union of GRCh3x_alldifficultregions.bed.gz and complexandSVs.bed for each version of the benchmark for each sample - *notin_complexandSVs_alldifficultregions.bed: complement of *complexandSVs_alldifficultregions.bed for each version of the benchmark for each sample - *HG002_Tier1plusTier2_v0.6.1.bed.gz: Regions containing GIAB v0.6 Tier1 or Tier2 insertions or deletions >=50bp, expanded by 25% on each side - *inversions*: putative inversions detected in PacBio and ONT assemblies, expanded by 25% on each side, including regions of breakpoint homology - *CNV_CCSandONT_elliptical_outlier*: potential duplications relative to the reference, detected as higher than normal coverage in PacBio CCS and/or ONT - *CNV_mrcanavarIllumina_CCShighcov_ONThighcov_intersection*: potential duplications relative to the reference, detected as higher than normal coverage in PacBio CCS and/or ONT and as segmental duplications by mrcanavar from Illumina - *CNV_gt2assemblycontigs_ONTCanu_ONTFlye_CCSCanu*: potential duplications relative to the reference, detected as more than 2 contigs aligning in 3 ONT and CCS Trio-binned assemblies - *HG002_GIABv4.1_CNVsandSVs.bed.gz: Union of the above putative CNV and SV bed files used to exclude regions from the v4.1 GIAB benchmark -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used to generate files (Files generated by Justin Zook): Use vcflib vcfgeno2haplo and unix commands to identify complex and compound variants in benchmark vcf files from GIAB (https://rdcu.be/bue67) for all samples, as well as Platinum Genomes (http://dx.doi.org/10.1101/gr.210500.116), and Real Time Genomics (http://doi.org/10.1089/cmb.2014.0029) for HG001/NA12878. The GRCh37 Genome Specific complex/compound/SVs variant BED files were remapped to GRCh38 using the NCBI Remapping Service. Methods for stratification file generation: Most complex/compound/SVs variant beds generated by genome_specific.sh, except for union beds generated by Merge_difficult_regions_GRCh38.sh. The CNV and SV beds were generated in GRCh38_Generating_v4.1_excluded_beds.ipynb Remapping Methods (Remapping performed by Jennifer McDaniel): Remap Genome Specific complex/compound/SVs variant BEDs GRCh37 --> GRCh38. (Performed June 2019). Remap of the GRCh37 Genome Specific BED files was performed with the NCBI Remapping Service (https://www.ncbi.nlm.nih.gov/genome/tools/remap). This service projects BED coordinates from GRCh37 to GRCh38. Prior to remap the GRCh37 BED files were trimmed using the script trim_bed_to_three_fields.sh to three fields to include chromosome number, start position and stop position. The trimmed BED files were input for remap using the following parameters: Source =GRCh37.p13::primary assembly (2/27/09) Target = GRCh38.p11::primary assembly (12/20/13) Default remapping options were used: Minimum ratio of bases that must be remapped = 0.5 Minimum ratio for difference between source length and target length = 2.0 Allow multiple locations to be returned = yes Merge Fragments = yes Post Processing of all files: Stratification BED(s) were post processed to remove reference Ns, specifically gaps and pseudoautosomal Y regions. The BEDs are merged and sorted and only contain chromosomes 1-22, X and Y. These are a 3-field BED (chromosome, start, end) that contains a header. Post processing scripts can be found in "Prepare Stratification BEDs for GitHub.ipynb" Dependencies: GRCh37 Genome Specific BEDs Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: NCBI Remapping Service (https://www.ncbi.nlm.nih.gov/genome/tools/remap) Describe any quality-assurance procedures performed on the data: Coverage comparison between GRCh37 and GRCh38 BED files was performed for each chromosome using R. We confirmed coverage between the BEDs were as expected. -------------------------- DATA-SPECIFIC INFORMATION -------------------------- File structure: All stratification files are standard 3-field BED files (chromosome, start, end) with header. File naming convention: GRCh38_Sample_BenchmarkVersion_StratificationType_slopadded.bed.gz