PLoS ONE (Jan 2021)

Genome-wide binding analysis of 195 DNA binding proteins reveals "reservoir" promoters and human specific SVA-repeat family regulation.

  • Michael J Smallegan,
  • Soraya Shehata,
  • Savannah F Spradlin,
  • Alison Swearingen,
  • Graycen Wheeler,
  • Arpan Das,
  • Giulia Corbet,
  • Benjamin Nebenfuehr,
  • Daniel Ahrens,
  • Devin Tauber,
  • Shelby Lennon,
  • Kevin Choi,
  • Thao Huynh,
  • Tom Wieser,
  • Kristen Schneider,
  • Michael Bradshaw,
  • Joel Basken,
  • Maria Lai,
  • Timothy Read,
  • Matt Hynes-Grace,
  • Dan Timmons,
  • Jon Demasi,
  • John L Rinn

DOI
https://doi.org/10.1371/journal.pone.0237055
Journal volume & issue
Vol. 16, no. 6
p. e0237055

Abstract

Read online

A key aspect in defining cell state is the complex choreography of DNA binding events in a given cell type, which in turn establishes a cell-specific gene-expression program. Here we wanted to take a deep analysis of DNA binding events and transcriptional output of a single cell state (K562 cells). To this end we re-analyzed 195 DNA binding proteins contained in ENCODE data. We used standardized analysis pipelines, containerization, and literate programming with R Markdown for reproducibility and rigor. Our approach validated many findings from previous independent studies, underscoring the importance of ENCODE's goals in providing these reproducible data resources. We also had several new findings including: (i) 1,362 promoters, which we refer to as 'reservoirs,' that are defined by having up to 111 different DNA binding-proteins localized on one promoter, yet do not have any expression of steady-state RNA (ii) Reservoirs do not overlap super-enhancer annotations and distinct have distinct properties from super-enhancers. (iii) The human specific SVA repeat element may have been co-opted for enhancer regulation and is highly transcribed in PRO-seq and RNA-seq. Collectively, this study performed by the students of a CU Boulder computational biology class (BCHM 5631 -Spring 2020) demonstrates the value of reproducible findings and how resources like ENCODE that prioritize data standards can foster new findings with existing data in a didactic environment.