针对长非编码RNA(long non-coding RNA,lnc RNA)数据类型多样带来的有用信息提取困难的问题,提出基于基因组浏览器GBrowse(Generic Genome Browser)的多源lnc RNA数据可视化系统.该系统主要包括网页服务器和lnc RNA数据存储.其中,网页服务器主要由HTTP服务和GBrowse网页组件构成,支持纯文本、My SQL、SQLite等多种数据存储方式.系统实现流程包括GBrowse安装与配置、多源lnc RNA数据的收集、数据预处理、数据存储、数据访问及可视化配置.原型系统收集了六种人类lnc RNA数据,包括人类基因注释、基因组序列、组蛋白修饰H3K4me3信号及其位点、转录因子CTCF绑定位点信号及其位点的数据,并对数据进行了预处理.通过My SQL、SQLite等建立了lnc RNA数据库,对数据的访问方式和可视化参数进行配置.实验结果表明,多源lnc RNA数据在GBrowse框架下能够得到整合与可视化,并在基因组空间同时呈现,这使得研究者能够以更加直观的方式观测数据,进而建立新的科学假说.
In consideration of the problem that useful information cannot be easily extracted from various types of long noncoding RNA(lnc RNA) data, this paper proposes a visualization system of multi-source lnc RNA data based on generic genome browser(GBrowse). The system mainly includes a web server including HTTP service and GBrowse components, and lnc RNA data storage which supports flat files, My SQL, SQLite and other types of databases. The main steps of constructing the system include GBrowse installation and configuration, multi-source lnc RNA data collection, preprocessing, storage, and access and visualization configuration. A demo system is constructed by firstly collecting six sets of human lnc RNA data, including human gene annotation, genome sequence, histone modification H3K4me3 signals and their loci predicted, signals of transcription factor CTCF binding sites and their loci predicted. After preprocessing, these data are stored by databases such as My SQL, SQLite and so on, and data access and visualization methods are also configured. The experiment results demonstrate that multi-source lnc RNA data can be integrated and visualized within the GBrowse framework, and be showed in the genome spatial space simultaneously, which can make researchers observe the lnc RNA data more intuitively, thereby helps to produce novel scientific hypothesis.