Computational Pipeline for analysis of known and prediction of novel miRNAs from the deep sequencing data
|This is a companion website to the paper "Analysis of microRNA transcriptome by deep sequencing of small RNA libraries of peripheral blood"
by : Candida Vaz, Hafiz M Ahmed, Pratibha Sharma, Rashi Gupta, Lalit Kumar, Ritu Kulshreshtha and Alok Bhattacharya
For each sample a Sequence file and a Tag file is provided. The Tag file comprises of unique sequences with their corresponding frequency. Tag Files are generated post alignment as a summary of Sequence Files and every Sequence File has a corresponding Tag File. Tag Files are generated to give the researcher an indication of most common to most rare sequences in the dataset. The numerical frequency of each sequence in the Sequence File for gene expression gives a true indication of relative expression of sequence transcripts. The unique sequences in the Tag files contain a 3' adaptor sequence (TCGTATGCCGTCTTCTGCTTG). The amount of the 3' adaptor is variable and is dependent on the length of the sRNA. Each delivered sequence is 33 /35 bases in length. A part of the adaptor sequence is seen in each sequence if the sRNA is shorter than 33/35 bases. This adaptor and segments of it needs to be trimmed for proper alignment to the transcriptome/genome
| (A) Preparation / Processing of the datasets:
|i). Removal of the adaptor sequences:|
|ii). Clustering and removal of redundancy after removal of the adaptor sequence:||
Although the tag file contains unique sequences, after the removal of the adaptor some are redundant. These identical sequences are represented once and their frequency is summed up.
|iii). Conversion into fasta format:||
The final trimmed file is represented in fasta format where the unique header (sequence identity) retains the information of the sequences length and frequency. The sequence ID comprises of a running number along with the length and frequency of that sequence thus ensuring that each trimmed sequence has a unique ID and will be referred by it.
Note: In case the trimmed file is provided this step can be skipped.
|iv). Finding known microRNA's using the trimmed file:||
One of our objective is to study the expression pattern of the known miRNAs. To study the expression pattern the trimmed files are matched against the known miRNA sequences. The link Trimmed_fasta_generation_profiling.zip contains programs to convert an already trimmed tag file into fasta format and to find known miRNAs using BLAST.
The updated version is TRIMMED_FASTA_generation_profiling_new.zip
|(B) Small RNA annotation: The sRNA sequences obtained are annotated against the known databases using the following protocol -
|i). Known/ annotated sequence databases.:||List of the databases used for the annotation / elimination pipeline are:
| (a) Mature miRNAs : from miRNA registry, release 12 (includes a total of 866 miRNAs [692 mature major +174stars (minors)]
| (b) ncRNAs : from Ensembl "Homo_sapiens.NCBI36.49.ncrna.fa" (includes the precursor miRNAs and other ncRNAs like sn/sno/sca RNAs, tRNAs, rRNAs).
| (c) RNA database from the FTP site NCBI (includes rRNAs and mRNAs).
The following databases are obtained by the perl scripts present in the CREATE_DATABASES.zip folder.
| (d) the exons are obtained from the Contig files.
| (e) intergenic/intronic sequences: are obtained from Contig files ( using the Homo Sapiens Contig file 29 Feb, 2008 version). These sequences serve as a source for finding novel mirnas (intergenic / intronic).
All the databases used are present in the folder ANNOTATION_SOURCE_DATABASES.zip
|(ii) The Elimination Pipeline:||To do a fast matching of the sequences with the created databases the scripts in the folder ELIMINATION_PIPELINE.zip are used. The folder also contains the algorithm and the way to use it. The pool of unmatched sequences left finally serve as a source of novel miRNAs, see Figure :
(C) Novel miRNA Prediction:The strategy is based on first removing all known annotated RNAs then identifying those that are derived from intergenic and intronic regions. These are then subjected to two ab initio miRNA prediction algorithms along with some of the filters that remove false positive predictions, see Figure :
|i) Extraction of matches from the intergenic /intronic regions of the human genome:||
The unmatched sequences (from the elimination pipeline) are then matched to the intergenic / intronic regions. The exact matched sequences are extracted along with 70 nucleotides flanking both the ends representing potential precursor sequences.
The programs to do so are present in : EXTRACTION.zip
The updated version is EXTRACTION_new.zip
|ii). Folding the extended sequences and checking its location in the folded structure:|| |
The sequences are scanned for presence of potential precursor miRNA using CID-miRNA and CSHMM (a probabilistic scoring tool) accepted for publication but the link will be provided once the paper is in print.
Use CID-miRNA package present in the folder miRNA_SCANNERS.zip
The folded sequences generated are then checked to see if the sRNA (putative mature miRNA obtained by sequencing) occur in the folded putative precursor as the window scanning approach used could report a folded structure not involving the concerned sRNA. Only those hairpins are kept which contains the sRNA.Programs for running CID-miRNA and checking the folded structures are present in Folding_checking.zip.
The updated version is FOLDING_CHECKING_new.zip
The next step involvs locating the position of the sRNA in the hairpins. Since mature miRNAs are known to be arising from the stem portion and not the loop, only those hairpins in which the sRNAs occurrs in the stem are classified as correct cases and the remaining as prediction errors. These correct cases are further tested by MiPred .
|iii). Finding IsomiRs and Star sequences:|| A list of all the predicted correct precursor sequences is created and the sRNAs derived from common precursors are grouped into a common family. The sRNAs derived from the same precursors are kept together in a family. The most abundant member is designated as the mature miRNA. The sRNAs that differ from the representative by a few nucleotides are called IsomiRs and those that had a different, partially complementary sequence and are located in the other strand (stem of the hairpin loop) are called stars. Example of one novel prediction (for the entire output see (vii) DEMO_final_novel_finder_new).The programs for grouping into IsomiRs families and picking the representative are present in Final_novel_prediction.zip.
The updated version is FINAL_NOVEL_FINDING_new.zip
The Additional files 4 and 5 comprise of the novel miRNAs grouped into families on the basis of sRNAs falling within the same precursor, and the representative putative (most abundant) miRNA, the individual frequency of the representative as well as of the family. The scores from the 4 tools (CID-miRNA , CSHMM, miRDeep, MiPred assigned to the corresponding precursors are also listed).
|Candida Vaz, 16.10.2009|