UNMC team develops NIH-backed genomic data analysis tools on the Google Cloud Platform 

Babu Guda, PhD, and Jordan Rowley, PhD

The explosion of research data being generated in the last 10 years has presented several computing challenges from its production to consumption.  

“Raw data in itself does not have much value unless it’s cleaned, normalized, analyzed, and correlated for knowledge extraction,” Babu Guda, PhD, director of the Bioinformatics and Systems Biology Core and professor and vice-chair in the UNMC Department of Genetics, Cell Biology and Anatomy (GCBA). 

To alleviate this challenge, Drs. Guda and Jordan Rowley, PhD, co-director of the Bioinformatics and Systems Biology graduate program and assistant professor in the department of GCBA at UNMC, jointly created a cloud-based learning module (Cloud-ATAC) that has been selected by the National Institutes of Health as the premier model for cloud-based learning as it relates to ATAC-seq and single cell ATAC-seq data analysis. 

Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) allows identification of chromatin accessibility and transcription factor footprints genome-wide, Dr. Rowley said. 

In conjunction with paired-end sequencing, ATAC-seq data can also be analyzed in a specific manner to provide genome-wide maps of nucleosome occupied chromatin and Transpose Hypersensitive Sites (THSSs), which correlate with transcription factor occupancy. 

“Due to the wealth of information it provides, ATAC-seq has quickly become a popular tool,” he said. 

“Your team did an outstanding job in working with the Google Professional Services team for developing a very high-quality cloud-based learning module which includes many advanced features for data analysis and data visualization such as sequence quality plots, genome viewer, and sequence motifs,” Lakshmi Kumar Matukumalli, program director in the NIH Networks and Development Programs Division for Research Capacity Building, wrote in a letter to Drs. Rowley and Guda. 

“It is an honor to be selected and recognized for our leadership in carrying out these types of projects that have such national importance,” Dr. Guda said. 

The software platform Drs. Guda and Rowley created will help harness and translate the immense volume of data generated in research and provide insight and answers to complex questions. 

“We need to have an entire ecosystem built to handle storage capacity, high performance computers, the bandwidth for data transfer and skilled personnel who can help researchers mine their data,” Dr. Guda said. 

To democratize access to this ecosystem, Dr. Guda said the NIH came up with the idea of having a cloud-based central location for publicly available data storage that would help smaller institutions access the data the same way larger institutions do and have access to the experts who can help them utilize these resources. 

That is when Drs. Guda and Rowley applied for a grant through the parent INBRE award to develop their cloud module that would be applied to ATAC-seq. 

The Cloud-ATAC module was developed using Jupyter notebooks on the Google Cloud Platform with the help of Avinash Veerappa, PhD, who is a member of Dr. Guda’s laboratory.  

The open-source software is now installed and available on the NIH/NIGMS Sandbox GitHub repository.  

Along with Cloud-ATAC, users can access eleven other cloud-based modules developed by NIH-backed projects to analyze a variety of research data.  

By providing self-paced interactive tutorials and example analyses on the cloud, the NIH Sandbox provides the means and training for performing advanced data analysis regardless of an individual researcher’s background or local computational resources.  

“These modules breakdown current barriers and offer an exciting opportunity for everyone to leverage the power of genomics in their research,” Dr. Rowley said.