1. Fang et al. GRNdb: decoding the gene regulatory networks in diverse human and mouse conditions, Nucleic Acids Research, 2020 (DOI: 10.1093/nar/gkaa995) 2. Li et al. Single-cell transcriptomic analysis reveals dynamic alternative splicing and gene regulatory networks among pancreatic islets, Science China Life Sciences, 2020 (DOI: 10.1007/s11427-020-1711-x)
At present, GRNdb contains 72 different single-cell conditions (332,920 cells) and 71 bulk conditions (involving 27,748 samples for 33 cancers of TCGA and 27 normal tissues of GTEx) for humans, as well as 41 single-cell conditions (300,150 cells) of different mouse tissues. To ensure the accuracy of gene regulatory network inference, we removed the TCGA (The Cancer Genome Atlas) and GTEx (Genotype-Tissue Expression) datasets that have less than 30 samples. More detailed information can be found on the Browse and Statistics webpages. We will collect and add more datasets in the next update of GRNdb.
We employed the SCENIC pipeline to infer the gene regulatory networks (GRNs) based on RNA-seq data as well as the known TF-target relationships and corresponding motifs from RcisTarget database (Aibar et al., Nature Methods, 2017). First, SCENIC utilizes GENIE3 to detect the gene sets coexpressed with transcription factors (TFs). Then, RcisTarget was employed to infer the putative direct-binding targets of TFs based on the motif-TF annotation databases. Finally, the GRNs were identified according to the online pipeline of SCENIC step-by-step. Only the best TF binding motifs predicted by SCENIC for TF-target pairs were shown in GRNdb.
The motifs for TF-target pairs in a certain condition were identified with the SCENIC pipeline (Aibar et al., 2017, Nature Methods), and only the best motif for each TF-target pair was used in GRNdb. SCENIC employs RcisTarget to identify the transcription factor binding motifs (TFBS) that are over-represented in a given gene set. In this step, SCENIC utilizes a database that includes the scores (rankings) of each motif around the transcription start site (TSS) of the genes in the organism. The motif score for each gene was calculated based on the search space around the TSS. For this analysis, SCENIC uses two databases: i) the database scoring the motifs in the 500bp upstream region of the TSS, and ii) the database scoring 10kb space around the TSS. By default, the motifs with Normalized Enrichment Score (NES) > 3.0 are defined as significantly enriched in the corresponding TF module.
In gene regulatory network analysis, the SCENIC pipeline normalized the AUC values into a Normalized Enrichment Score (NES). A high NES score denotes a motif that recovers a large proportion of the input genes within the top of its ranking. To identify the significantly enriched motif, the threshold of 3.0 was used in the SCENIC pipeline, which corresponds to a False Discovery Rate (FDR) between 3% and 9%. Then, the significant motifs are linked back to transcription factors (TFs) using the annotation databases of RcisTarget in the SCENIC pipeline.
In gene regulatory network inference, the SCENIC pipeline used two different types of motif annotations provided by the cisTarget databases. One type is annotated in the original database of cisTarget or inferred by orthology, which is denoted as high-confidence. Another type is inferred by motif similarity (indicated as low-confidence). If the TF motif is from the former type, it will be annotated as high-confidence. Otherwise, it will be denotated as low-confidence.
The main regulons inferred by the SCENIC pipeline only used the “high confidence” annotations of the cisTarget database, which are “direct annotation” and “inferred by orthology” by default. The suffix '_extended' in the regulon name denotes lower confidence annotations inferred by motif similarity are also utilized.
We collected the single-cell RNA-seq datasets from public databases of Gene Expression Omnibus (GEO) and ArrayExpress. Moreover, the bulk datasets of diverse cancers and normal tissues were downloaded from UCSC Xena (for TCGA) and GTEx, respectively. The accession ID of the original data for each condition can be found on the Statistics webpage.
All the TF-target pairs for various conditions of human and mouse can be freely obtained on the Download webpage.
We employed Seurat (Stuart et al. 2019, Cell) with the standard pipeline to define the clusters for those single-cell datasets without detailed cell-type/cluster information in the published papers of the original studies. If the cell type information can be obtained from the original paper, we used the annotation directly.
The markers for each cluster of different single-cell conditions were identified with Seurat (Stuart et al. 2019, Cell) using the standard pipeline. Users can freely download the markers for all clusters of a specific condition provided in GRNdb for further analysis.
In GRNdb, the human genes have been linked to GeneCards, while the mouse genes have been linked to the MGI database. Once the users click the gene name, it will automatically search the gene in the relevant database and show its detailed information and function.
There is no limit for the number of input genes on the Expression webpage. In consideration of the performance, we recommend that the count of genes is no more than 30.
We used the Python package of lifelines to do the survival analysis. The median expression of the input gene was used to stratify the patients of a specific cancer into two different groups. Moreover, we did not set any limit for the number of input genes on the Survival webpage. To ensure the performance, it is better to keep the count of genes less than 30.
Yes, we will continue to collect more datasets and add the inferred gene regulatory networks into GRNdb in the future.
You can go to the Browse webpage first, and then find your interested tissue/condition of a specific organism. Once you click the selected tissue/condition, all the inferred TF-target pairs will be shown.
You just need to click the download sign of your interested figure in GRNdb, then the figure will be downloaded automatically.
We are planning to continuously update GRNdb in the future. Once we finish the processing and analysis of newly collected datasets, we will add the related gene regulatory network information into GRNdb. The updated information will be shown on the Home webpage.