User Tools

Site Tools


proteinsequence

Edit: Feel free to edit this page if you're interested in working on a project like this.

I am interested in turning protein sequences into some form of audio. The basic concept is to turn the 20 or so different amino acids in the protein sequence (which can range from 50 to 100s of amino acids long) into some kind of audio signal, and then analyze the signal.

I mainly work in bioinformatics, so I'm not very knowledgeable on the audio processing end, but I have access to a lot of different protein sequences to work with. I'm open to suggestions and really any ideas related to biology or bioinformatics in general.

Cheers, Tim


Hey Tim, hope it's ok if I edit this, I didn't know if my project ideas would warrant a new link… I was also thinking of some sort of genome to audio translation. On a massive scale, wouldn't it be beautiful to input your DNA, and output a symphony? Obviously that's unachievably huge right now, and we're not about to go sequencing our own genomes, but there are a number of fully sequenced bacteria/plants/animals that we could use to produce something along the lines of… riffs on Haemophilus influenzae, for example.

Similarly, I was also thinking it would be really neat to create a demographic/epidemiologic data –> audio program that would basically take an equation as input (I guess the program would solve a linear or logistic regression of user-input variables on perhaps age [continuous], gender [binary], race [categorical], other variables of interest) - could be cool to create sound profiles of individuals or risk groups for various diseases? I haven't entirely hammered out this thought.

-Eli(zabeth)


Hey there, a quick suggestion… You probably don't want to start synthesizing audio from scratch, as there are tons of synthesis engines out there. I have used ChucK recently for a sonification project. A more popular audio synthesis environment/engine is SuperCollider. There might be a learning curve to figuring out the programming syntax for these synthesis engines. But if you get past that (or solicit someone who already knows), you can probably create lots of fun mappings between bioinformatic data and audio.

–Andy


Hi guys,

Eli: Both those ideas sound really cool. The first idea I was thinking we could either convert protein or genomic sequence into a riff, or perhaps into one sound, using various frequencies for amino acids. I'm not really sure how well either of them would go, how feasible it is, and whether it'd even sound remotely interesting! But I'm open minded. As for your second idea, do you have a dataset to work with? I'm not really sure how we'd sonify the data, but it's definitely along the lines of something I'd be interested in doing.

Andy: I appreciate the tip. I downloaded both those programs, so at the very least, I'll be able to take a look at them this weekend.

If there's anyone else interested in a project like this, feel free to edit this post with your comments!

-Tim


Audio representation of a protein sequence

A protein is comprised of a sequence of amino acids. Each amino acid may be represented by an SLC, so a protein sequence may be input as a string like this:

mvlspadktnvkaawgkvgahageygaealermflsfpttktyfphfdlshgsaqvkghgkkvadaltnavahvddmpnalsalsdlhahklrvdpvnfkllshcllvtlaahlpaeftpavhasldkflasvstvltskyr

Amino acids are encoded by 3-base DNA codons. There are four bases: A, T, C, and G. Therefore, there are 4*4*4=64 possible 3-base codons. A protein sequence may therefore also be represented by a string of bases.

There are 20 naturally-occuring amino acids, but all possible codons encode for an amino acid (with three “stop” codons), which means that there is redundancy. For example, there are three codons for isoleucine: ATT, ATC, and ATA. Therefore, the same sequence may be represented by a different string of base pairs, that however corresponds to the same string of amino acids (and therefore the same protein).

  • Amino acid - SLC - DNA codons:
  • Isoleucine - I - ATT, ATC, ATA
  • Leucine - L - CTT, CTC, CTG, TTA, TTG
  • Valine - V - GTT, GTC, GTA, GTG
  • Phenylalanine - F - TTT, TTC
  • Methionine - M - ATG
  • Cysteine - C - TGT, TGC
  • Alanine - A - GCT, GCC, GCA, GCG
  • Glycine - G - GGT, GGC, GGA, GGG
  • Proline - P - CCT, CCC, CCA, CCG
  • Threonine - T - ACT, ACC, ACA, ACG
  • Serine - S - TCT, TCC, TCA, TCG, AGT, AGC
  • Tyrosine - Y - TAT, TAC
  • Tryptophan - W - TGG
  • Glutamine - Q - CAA, CAG
  • Asparagine - N - AAT, AAC
  • Histidine - H - CAT, CAC
  • Glutamic acid - E - GAA, GAG
  • Aspartic acid - D - GAT, GAC
  • Lysine - K - AAA, AAG
  • Arginine - R - CGT, CGC, CGA, CGG, AGA, AGG
  • Stop - TAA, TAG, TGA

The question is, how do we translate a sequence into some sort of audio output? Let's try to start with some trivial exercises, and work up to more complex (and hopefully meaningful/interesting/musical) audio representations.

  1. Each base is translated into a tone (arpeggiated chord)
  2. Each base is randomly translated into one of two notes (simple melody)
  • Example:
  • A: C, D
  • T: E, F
  • C: G, A
  • G: B, C(+1 octave)
  1. Each base is randomly translated into one of two notes, notes are sustained such that they build * codons (in a sonata-type effect)
  • Note 1 lasts 3 beats
  • Note 2 lasts 2 beats
  • Note 3 lasts 1 beat
  1. Amino acids are translated into a 20-tone scale (sequence –> melody)
  2. Amino acids are categorized into elements of a song by side chain characteristics/other chemical properties

Amino acid classes –> timbres, and each AA within a class is a different pitch

  1. Aliphatic (G, A, V, L, I) - sine
  2. Hydroxyl/Sulfur-containing (S, C, T, M) - triangle
  3. Cyclic (P) - resonant filtered noise
  4. Aromatic (F, Y, W) - rectangle
  5. Basic (H, K, R) - high-pass filtered noise
  6. Acidic & Amide (D, E, N, Q) - saw

Side-chain polarity –> rhythmic gating

  1. Polar (N, Q, Y, S, T, R, K, H, D, E)
  2. Nonpolar (P, W, G, A, M, C, F, L, V, I)

Side-chain charge –> panning

  1. Negative - Left (D, E)
  2. Neutral - Center (N, Q, P, Y, W, S, T, G, A, M, C, F, L, V, I, H*)
  3. Positive - Right (R, K)

*H is neutral 90% of the time, positive 10% of the time

Hydropathy index –> amplitude (neg –> pos) R, K, D, E, N, Q, H, P, Y, W, S, T, G, A, M, C, F, L, V, I


Hemoglobin subunit alpha (Homo sapiens):

mvlspadktnvkaawgkvgahageygaealermflsfpttktyfphfdlshgsaqvkghgkkvadaltnavahvddmpnalsalsdlhahklrvdpvnfkllshcllvtlaahlpaeftpavhasldkflasvstvltskyr

Hemoglobin subunit beta (Homo sapiens):

mvhltpeeksavtalwgkvnvdevggealgrllvvypwtqrffesfgdlstpdavmgnpkvkahgkkvlgafsdglahldnlkgtfatlselhcdklhvdpenfrllgnvlvcvlahhfgkeftppvqaayqkvvagvanalahkyh

RuBisCO small subunit (Plocamium serrulatum): mritqgtfsflpdltdeqikkqveyaiskkwsvgieytedphprnsywel

RuBisCO large subunit (Plocamium serrulatum):

alfrvtpqpgvdpieasaavagesstatwtvvwtdlltacdlyrakaykvdavpntpdqyfafvaydidlfeegsipnltasiignvfgfkavkalrledmripvaylktfqgpatgiiverermdkfgrpflgatvkpklglsgknygrvvyeglkggldflkddeninsqpfmrwkerylysmegvnraiaasgevkghylnvtcatieemyeraefakqlgsiiimidlvigytaiqtmaiwarrndmilhlhragnstysrqkihgmnfrvickwmrmsgvdhihagtvvgklegdplmirgfyntlllthlsvnlpqgiffeqdwaslrkvtpvasggihcgqmhqlldylgddvvlqfgggtighpdgiqagatanrvaleamvlarnegrdyvnegpqilqdaakncgplqtaldlwkdisfnytstdtadfvdtptsnv

P53 wild type (Homo sapiens):

meepqsdpsvepplsqetfsdlwkllpennvlsplpsqamddlmlspddieqwftedpgpdeaprmpeaaprvapapaaptpaapapapswplsssvpsqktyqgsygfrlgflhsgtaksvtctyspalnkmfcqlaktcpvqlwvdstpppgtrvramaiykqsqhmtevvrrcphhercsdsdglappqhlirvegnlrveylddrntfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledssgnllgrnsfevhvcacpgrdrrteeenlrkkgephhelppgstkralsnntssspqpkkkpldgeyftlqirgrerfemfrelnealelkdaqagkepggsrahsshlkskkgqstsrhkklmfktegpdsd

P53 isoform b (Homo sapiens):

meepqsdpsvepplsqetfsdlwkllpennvlsplpsqamddlmlspddieqwftedpgpdeaprmpeaappvapapaaptpaapapapswplsssvpsqktyqgsygfrlgflhsgtaksvtctyspalnkmfcqlaktcpvqlwvdstpppgtrvramaiykqsqhmtevvrrcphhercsdsdglappqhlirvegnlrveylddrntfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledssgnllgrnsfevrvcacpgrdrrteeenlrkkgephhelppgstkralpnntssspqpkkkpldgeyftlqdqtsfqkenc

P53 isoform c (Homo sapiens):

meepqsdpsvepplsqetfsdlwkllpennvlsplpsqamddlmlspddieqwftedpgpdeaprmpeaappvapapaaptpaapapapswplsssvpsqktyqgsygfrlgflhsgtaksvtctyspalnkmfcqlaktcpvqlwvdstpppgtrvramaiykqsqhmtevvrrcphhercsdsdglappqhlirvegnlrveylddrntfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledssgnllgrnsfevrvcacpgrdrrteeenlrkkgephhelppgstkralpnntssspqpkkkpldgeyftlqmlldlrwcyflinss

Insulin (Homo sapiens):

malwmrllpllallalwgpdpaaafvnqhlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn

Insulin (Octodon degus):

mapwmhlltvlallalwgpnsvqayssqhlcgsnlvealymtcgrsgfyrphdrreledlqveqaelgleagglqpsalemilqkrgivdqccnnictfnqlqnycnvp

Insulin (Ovis aries):

malwtrlvpllallalwapapahafvnqhlcgshlvealylvcgergffytpkarrevegpqvgalelaggpgagglegppqkrgiveqccagvcslyqleycn

Insulin (Oryctolagus cuniculus):

maslaallpllallvlcrldpaqafvnqhlcgshlvealylvcgergffytpksrreveelqvgqaelgggpgagglqpsalelalqkrgiveqcctsicslyqlenycn

proteinsequence.txt · Last modified: 2013/09/29 12:49 by epiette