This is a repository for the data used on Tassios et al.. The folder real_sequences contains the FASTA files of the artificial intergenic ORFs (iORFs) of 332 Saccharomycotina genomes published by Shen et al.. iORFs were generated out of every intergenic region larger than six nucleotides as in Vakirlis et al. (2018) and Vakirlis, Acar, et al. (2020). For every genome assembly, coordinates of intergenic regions were obtained using the BEDtools subtract tool together with the corresponding annotation GTF file. We extracted the intergenic sequences, removed stop codons in the +1 reading frame and then translated the sequences using custom Python scripts (one intergenic sequence produced one iORF). Sequences that had non-ACGT nucleotides were discarded.
The folder randomized_sequences contains the FASTA files of th single nucleotide shuffled iORF controls, nucleotide iORF sequences were shuffled using custom Python scripts, following the stop codon removal. Each position of the nucleotide sequence was randomized and whenever an in-frame stop codon was formed from the randomization, its three positions were randomized again until they did not form a stop codon. This ensured that the shuffled sequences retained the exact same nucleotide composition as their real counterparts.