Here there are included genes and non-coding RNAs in Ensembl 54 database and lincRNA genes present in Emnsembl 59 database. The files included contain different kinds of genes, described below. In all the cases, only clusters overlapping exons are given (there are no clusters in intergenic, intronic, repeats or promoter regions). From these, only those clusters for which we predict a high pair probbability region overlapping are given. ############## # EXON FILES # ############## miRNA.xls --------- Contains the information of clusters overlapping miRNA genes snoRNA.xls ---------- Contains the information of clusters overlapping snoRNA genes and snoRNA pseudogenes protein_coding_genes.xls ------------------------ Contains the information of clusters overlapping protein coding genes IG_V_gene.xls -------------- Contains the information of clusters overlapping Exons from IG_V genes all_other_pseudogene.xls ------------------------ Contains the information of clusters overlapping processed pseudogenes, pseudogenes, retrotransposed pseudogenes and unprocessed pseudogenes misc_RNA.xls ------------ Contains the information of clusters overlapping rRNA.xls -------- Contains the information of clusters overlapping rRNA genes, rRNA pseudogenes and mt_rRNA genes scRNA.xls --------- Contains the information of clusters overlapping scRNA genes and scRNA pseudogenes snRNA.xls --------- Contains the information of clusters overlapping snRNA genes and snRNA pseudogene tRNA.xls -------- Contains the information of clusters overlapping tRNAs, tRNA pseudogenes, mt_tRNA and mt_tRNA pseudogenes LincRNA.xls ----------- Contains the information of clusters overlapping LincRNAs from Ensembl59 mapped to hg18 ################################## # INFORMATION FOUND IN EACH FILE # ################################## In each file, the candidate exons are sorted by number of reads in the significnat cluster, the number of datasets in which we find a significant cluster overlapping the exon, the score of the stem, the energy of the secondary structure and the average pair probability of the structure. For each exon, one or more lines can exist in the excel file, each of them corresponding to a different cluster. Each file contains 38 fields: NUMBER_OF_DATASETS: number of different datasets in which we found significant clusters for an exon. The value ranges from 1 (only one dataset has evidence for the exon) to 4 (all the datasets have evidence). DATASET: name of the dataset containing the significant read of clusters. Values: --> D8.1: CLIP against endogenous protein (1st CLIP) --> T7.1: CLIP against T7-flagged protein (1st CLIP) --> D8.2: CLIP against endogenous protein (2nd CLIP) --> T7.2: CLIP against T7-flagged protein (2nd CLIP) EXON_ID: exon id GENE_ID: ensembl54 gene id GENE_NAME: gene name BIOTYPE: type of molecule analyzed (tRNA/snoRNA/protein_coding, etc) CHR START: exon start END: exon end STRAND TRANSCRIPTS: Ids of the transcripts containing the exon EXON_TYPE: Type of exon. Values (INTERNAL, TERMINAL, FIRST, SINGLE_EXON_GENE, etc). ALTERNATIVE?: Information about alternative splicing. Values: CONSTITUTIVE, CASSETTE or NO INFO --> CONSTITUTIVE: at least 10 ESTs overlap the region and is included in 90% or more of them (only for INTERNAL exons) --> Cassette: at least 10 ESTs overlap the region and is included in less than 90% of them (only for INTERNAL exons) --> NO INFO: the exon has less than 10 ESTs overlaping the region or is not an INTERNAL exon SKIPPING_EST: number of ESTs skipping the exon (only for INTERNAL exons. Otherwise there is a "-") Including_EST: number of ESTs verifying the exon (only for INTERNAL exons.Otherwise there is a "-") MOUSE_HOMOLOG: Id of the mouse homolog exon (if exist). Otherwise there is a "-". MOUSE_ALTERNATIVE?: Information about alternative splicing. Values: CONSTITUTIVE, CASSETTE or NO INFO MOUSE_SKIPPING_EST: number of ESTs skipping the exon (only for INTERNAL exons. Otherwise there is a "-") MOUSE_INCLUDING_EST: number of ESTs verifying the exon (only for INTERNAL exons. Otherwise there is a "-") SOURCE: Information about the origin of the data. Values: --> genomic: The significant cluster was found in the analysis of the reads mapped to the genome --> transcript: The significant cluster was found in the analysis of the reads mapped to the transcriptome CLUSTER_START: position of the cluster start relative to the start of the exon. A negative value means that the cluster starts before the start CLUSTER_END: position of the cluster end relative to the start of the exon CLUSTER_OFFSET: position of the cluster end relative to the end of the exon. A negative value means that the cluster end after the exon end RANGE: position of the cluster. Values: --> pre-mRNA: the cluster overlaps an exon an the flanking intron(s) --> mRNA: the cluster overlaps an exon and the previous/next exon(s) in the mRNA --> exonic: the cluster is included in the exon. CLUSTER_GENOMIC_START: genomic position of the cluster start, if possible (only exonic and pre-mRNA) CLUSTER_GENOMIC_END: genomic position of the cluster end, if possible (only exonic and pre-mRNA) NUMBER_OF_READS: number of reads in the cluster STRUCTURE_START: position of the structure start relative to the start of the exon. A negative value means that the structure starts before the start STRUCTURE_END: position of the structure end relative to the start of the exon STRUCTURE_OFFSET: position of the structure end relative to the end of the exon. A negative value means that the GENOMIC_STRUCTURE_START: genomic position of the structure start, if possible (only exonic and pre-mRNA) GENOMIC_STRUCTURE_END: genomic position of the structure end, if possible (only exonic and pre-mRNA) STEM_LENGTH: length of the stem that has the higher number of nt overlapping with the read cluster STEM_SCORE: score of the stem. Is the number of nt of the stem in a base-pair minus the number of nt in a base-pairfrom "internal" stems SEQUENCE: sequence of the predicted structure STRUCTURE: structure predicted OPTIMAL_ENERGY: energy of the structure predicted AVERAGE_PP: average pair probability of the structure predicted PP_PER_POSITION: pair probability of each of the nucleotides in the structure, separated by ":" ############### # OTHER FILES # ############### In each of the cases, only those clusters for which we predict a high pair probbability region overlapping are given. intron.xls ---------- Contains the information of clusters overlapping introns from coding genes (and not overlapping any other type of elements i.e. exons, snoRNA, miRNA, etc.) included in ensembl annotation intergenic.xls -------------- Contains the information of clusters in intergenic regions not-overlapping any annotated element included un ensembl, nor repeats from repepeatmasker promoter.xls -------------- Contains the information of clusters in promoter regions of ensembl genes. In this case, the promoters are defined as the 1000nt upstream of the start site plus 200nt downstream of the start site Each file contaisn the following fields (not all fields are present in all files) NUMBER_OF_DATASETS: number of different datasets in which we found significant clusters for an exon. The value ranges from 1 (only one dataset has evidence for the exon) to 4 (all the datasets have evidence). DATASET: name of the dataset containing the significant read of clusters. Values: DATASET: name of the dataset containing the significant read of clusters. Values: --> D8.1: CLIP against endogenous protein (1st CLIP) --> T7.1: CLIP against T7-flagged protein (1st CLIP) --> D8.2: CLIP against endogenous protein (2nd CLIP) --> T7.2: CLIP against T7-flagged protein (2nd CLIP) GENE_ID(only intron and promoter file): ensembl54 gene id GENE_NAME(only intron and promoter file): gene name GENE_START(only promoter file): start of the gene GENE_END(only promoter file): end of the gene POSITION(only promoter file): position of the region relative to the gene start. The format is "cluster_start..cluster_end". If the coordinates are negative, it means that the cluster is located dowsntream of the gene start. If the coordinates are positive, it means that the cluster is located upstream of the gene start. i.e -127..-158 => the cluster starts 127 nt downstream of gene start and ends 158 nt downsstream of the gene start CHR START: minimum start. If there are several overlapping clusters, this is the start position of the left-most cluster END: maximum end. If there are several overlapping clusters, this is the start position of the right-most cluster CLUSTERS: summary of the real clusters in this region. The cluster ids are separated by "::". Each of the ids contains the following information: chr_cluster-start_cluster-end_strand|dataset i.e.: chr_cluster-start_cluster-end_strand|dataset::chr_cluster-start_cluster-end_strand|dataset if the same cluster belongs to more than one dataset, it is also shown in the id, addint the dataset to the previous dataset i.e.: chr_cluster-start_cluster-end_strand|dataset1:dataset2 ENDOGENOUS_READS: number of reads in the endogenous dataset T7_READS: number of reads in the T7 dataset TOTAL_READS: total amount of reads overlapping the cluster (from endogenous and T7 datasets) SCORE: score of the region in which we do the structure prediction (average pair probability of the region) will be used to identify miRNA candidates miRNA(only in intergenic and intron file): indicates whether the score of the region is over the miRNA cut-off ("YES") or not ("NO") STRUCTURE_START: genomic position of the structure start, if possible (only exonic and pre-mRNA) STRUCTURE_END: genomic position of the structure end, if possible (only exonic and pre-mRNA) STEM_LENGTH: length of the stem that has the higher number of nt overlapping with the read cluster STEM_SCORE: score of the stem. Is the number of nt of the stem in a base-pair minus the number of nt in a base-pair from "internal" stems SEQUENCE: sequence of the predicted structure STRUCTURE: structure predicted OPTIMAL_ENERGY: energy of the structure predicted AVERAGE_PP: average pair probability of the structure predicted PP_PER_POSITION: pair probability of each of the nucleotides in the structure, separated by ":" ###################################################### # ANNOTATION DETAILS OF NON-CODING RNAs FROM ENSEMBL # ###################################################### Most ncRNAs are annotated by aligning genomic sequence against RFAM using BLASTN. The BLAST hits are clustered and filtered by E value and are used to seed Infernal searches of the locus with the corresponding RFAM covariance models. The purpose of this is to reduce the search space required, as to scan the entire genome with all the RFAM covariance models would be extremely CPU-intensive. The resulting BLAST hits are then used as supporting evidence for ncRNA genes. miRNAs are predicted by BLASTN of genomic sequence slices against miRBase sequences. All species are used. The BLAST hits are clustered and filtered by E value and the aligned genomic sequence is then checked for possible secondary structure using RNAFold. If evidence is found that the genomic sequence could form a stable hairpin structure, the locus is used to create a miRNA gene model. The resulting BLAST hit is used as supporting evidence for the miRNA gene. Note: The miRNA identifier and name are only associated to the resulting Ensembl miRNA if they are of the same species. tRNAs are annotated as part of the raw compute process using tRNAscan-SE. lincRNA (Long non-coding RNAs) Chromatin-state map data, gene annotation, along with cDNA and next-generation sequencing data are used to predict lincRNAs for human and mouse. Regions of chromatin methylation (H3K4me3 and H3K36me3) outside of known protein-coding loci are identified [1]. A comparative approach identifies conserved regions using alignments determined by the Ensembl compara project, following a similar method to C. Ponting [2]. A validation step investigates if cDNA or next-generation sequencing data overlap the identified conserved chromatin methylation regions. PFAM/tigrfam tests if ORFs of the identified ncRNAs contain protein domains. ####################### # CUSTOM TRACKS LINKS # ####################### For the datasets from the second CLIP experiment (D8.2 and T7.2) we have created custom tracks in BedGraph format: -> BedGraph significant clusters (mFDR p-value < 0.01): Clusters of reads from a given dataset mapped to the genome that are significant (p-value <0.01) after applying a modified False Discovery Rate test (mFDR). The blocks correspond to clusters with a p-value smaller than 0.01. -> BedGraph all clusters: Contains ALL clusters of reads of a dataset mapped to the genome.