1. Fang et al. GRNdb: decoding the gene regulatory networks in diverse human and mouse conditions, Nucleic Acids Research, 2020 (DOI: 10.1093/nar/gkaa995)
2. Li et al. Single-cell transcriptomic analysis reveals dynamic alternative splicing and gene regulatory networks among pancreatic islets, Science China Life Sciences, 2020 (DOI: 10.1007/s11427-020-1711-x)
At present, GRNdb contains 72 different single-cell conditions (332,920 cells) and 71 bulk
conditions (involving 27,748 samples for 33 cancers of TCGA and 27 normal tissues of GTEx) for
humans, as well as 41 single-cell conditions (300,150 cells) of different mouse tissues. To
ensure the accuracy of gene regulatory network inference, we removed the TCGA (The Cancer Genome
Atlas) and GTEx (Genotype-Tissue Expression) datasets that have less than 30 samples. More
detailed information can be found on the Browse and Statistics webpages. We will collect and add
more datasets in the next update of GRNdb.
We employed the SCENIC pipeline to infer the gene regulatory networks (GRNs) based on RNA-seq
data as well as the known TF-target relationships and corresponding motifs from RcisTarget
database (Aibar et al., Nature Methods, 2017). First, SCENIC
utilizes GENIE3 to detect the gene sets coexpressed with transcription factors (TFs). Then,
RcisTarget was employed to infer the putative direct-binding targets of TFs based on the
motif-TF annotation databases. Finally, the GRNs were identified according to the
online
pipeline of SCENIC step-by-step.
Only the best TF binding motifs predicted by SCENIC for TF-target pairs were shown in GRNdb.
The motifs for TF-target pairs in a certain condition were identified with the SCENIC pipeline
(Aibar et al., 2017, Nature Methods), and only the best motif for each TF-target pair was used
in GRNdb. SCENIC employs RcisTarget to identify the transcription factor binding motifs (TFBS)
that are over-represented in a given gene set. In this step, SCENIC utilizes a database that
includes the scores (rankings) of each motif around the transcription start site (TSS) of the
genes in the organism. The motif score for each gene was calculated based on the search space
around the TSS. For this analysis, SCENIC uses two databases: i) the database scoring the motifs
in the 500bp upstream region of the TSS, and ii) the database scoring 10kb space around the TSS.
By default, the motifs with Normalized Enrichment Score (NES) > 3.0 are defined as significantly
enriched in the corresponding TF module.
In gene regulatory network analysis, the SCENIC pipeline normalized the AUC values into a
Normalized Enrichment Score (NES). A high NES score denotes a motif that recovers a large
proportion of the input genes within the top of its ranking. To identify the significantly
enriched motif, the threshold of 3.0 was used in the SCENIC pipeline, which corresponds to a
False Discovery Rate (FDR) between 3% and 9%. Then, the significant motifs are linked back to
transcription factors (TFs) using the annotation databases of RcisTarget in the SCENIC pipeline.
In gene regulatory network inference, the SCENIC pipeline used two different types of motif
annotations provided by the cisTarget databases. One type is annotated in the original database
of cisTarget or inferred by orthology, which is denoted as high-confidence. Another type is
inferred by motif similarity (indicated as low-confidence). If the TF motif is from the former
type, it will be annotated as high-confidence. Otherwise, it will be denotated as
low-confidence.
<>SCENIC employs the GENIE3 model to infer gene regulatory networks. The GENIE3 Weights denote the weights of the links directed from TFs to target genes. The higher weights correspond to more likely regulatory links. But the weights of the links do not have any statistical meaning, which only provides a way to rank the regulatory links. Notably, caution must be taken when choosing one cutoff since there is no standard threshold value. More details can be found on
the GENIE3 website.
The main regulons inferred by the SCENIC pipeline only used the “high confidence” annotations of
the cisTarget database, which are “direct annotation” and “inferred by orthology” by default.
The suffix '_extended' in the regulon name denotes lower confidence annotations inferred by
motif similarity are also utilized.
We collected the single-cell RNA-seq datasets from public databases of Gene Expression Omnibus
(GEO) and ArrayExpress. Moreover, the bulk datasets of diverse cancers and normal tissues were
downloaded from UCSC Xena (for TCGA) and GTEx, respectively. The accession ID of the original
data for each condition can be found on the Statistics webpage.
All the TF-target pairs for various conditions of human and mouse can be freely obtained on the
Download webpage.
We employed Seurat (Stuart et al. 2019, Cell) with the standard pipeline to define the clusters for those single-cell
datasets without detailed cell-type/cluster information in the published papers of the original
studies. If the cell type information can be obtained from the original paper, we used the
annotation directly.
The markers for each cluster of different single-cell conditions were identified with Seurat
(Stuart et al. 2019, Cell) using the
standard
pipeline. Users
can freely download the markers for all clusters of a specific condition provided in GRNdb for
further analysis.
In GRNdb, the human genes have been linked to GeneCards,
while the
mouse genes have been linked to the
MGI database.
Once the
users click the gene name, it will automatically search the gene in the relevant database and
show its detailed information and function.
There is no limit for the number of input genes on the Expression webpage. In consideration of
the performance, we recommend that the count of genes is no more than 30.
We used the
Python package of lifelines
to do the survival analysis. The median expression of the input gene was used to stratify the
patients of a specific cancer into two different groups. Moreover, we did not set any limit for
the number of input genes on the Survival webpage. To ensure the performance, it is better to
keep the count of genes less than 30.
Yes, we will continue to collect more datasets and add the inferred gene regulatory networks into
GRNdb in the future.
You can go to the Browse webpage first, and then find your interested tissue/condition of a
specific organism. Once you click the selected tissue/condition, all the inferred TF-target
pairs will be shown.
You just need to click the download sign of your interested figure in GRNdb, then the figure will
be downloaded automatically.
We are planning to continuously update GRNdb in the future. Once we finish the processing and
analysis of newly collected datasets, we will add the related gene regulatory network
information into GRNdb. The updated information will be shown on the Home webpage.