Promoters play a central role in controlling gene regulation; however, a small set of promoters is used for most genetic construct design in the yeast Saccharomyces cerevisiae. The ability to generate and utilize models that accurately predict protein expression from promoter sequence may enable rapid generation of novel useful promoters, facilitating synthetic biology efforts in this model organism. We measured the activity of over 675,000 unique sequences in a constitutive promoter library, and over 327,000 sequences in a library of inducible promoters. Training an ensemble of convolutional neural networks jointly on the two datasets enabled very high (R2 > 0.79) predictive accuracies on multiple prediction tasks. We developed model-guided design strategies which yielded large, sequence-diverse sets of novel promoters exhibiting activities similar to current best-in-class sequences. In addition to providing large sets of new promoters, our results show the value of model-guided design as an approach for generating DNA parts.
Overall design
Promoter activity was measured using a “FACS-seq” reporter-based assay. Libraries of yeast cells harboring a plasmid in which mCherry expression was driven by PTEF1 (as a control for expression noise) and GFP expression was driven by a member of a sequence library were grown in synthetic complete media containing 2 percent dextrose and lacking uracil, and were FACS-sorted on the basis of the ratio of GFP to mCherry expression. Plasmid DNA was extracted from each bin, and bin-specific barcodes were applied by PCR. PCR amplicons were pooled and sequenced to derive read counts for each sequence in each bin; this data was used to extract quantitative estimates of promoter activity for each sequence. This was first done for a library of constituitive promoters based on natural pGPD, then for one of beta-estradiol-inducible promoters based on the pZEV system (McIsaac et al. 2014, PMID 24445804). Neural network models of promoter activity were then trained on results from the first two libraries and leveraged to generate sets of novel promoters designed to fulfill a variety of objectives. These designs, and control promoters from the first two libraries, were assayed in the third experiment. In the first two experiments, samples were sequenced on Illumina Miseq (2x300) and Nextseq (1x75) platforms. In the third, only Miseq was used. In the GPD experiment, 2 replicates of 12 bins each were collected; in the following experiments, 12 bins each in the presence and absence of 1 uM beta-estradiol inducer were collected. In some experiments, aliquots of the original library were also prepared for sequencing.