Taoffi's blog | September 2010

During several years, I worked on a DNA sequence-analysis software project where I learned about DNA structures and several DNA engineering techniques.

(That is not to be confused with what is commonly called ‘genetic algorithms’ and ‘genetic programming’).

Elements and structures used (and manipulated) in molecular biology engineering sphere are fascinating and, above all, source of interesting knowledge. In some way, they define the representation of ‘Life hood’ structures and mechanisms for every living creature (plant, animal, bacteria, virus…)

Molecular biology teaches us that ‘Life’ is based on few nucleotides: A, T, G and C.

Like in music partitions, sequences composed of these few number of nucleotides can give unlimited number of combinations each representing the structure of a specific biological function and, in the end, a specific individual being.

Here are some sample (random) fragments of DNA sequences:

3’à…GCAATGGTTTCTTACTGTGGAGGACATAAAAATACAGCAAGGGT…à5’

3’à…TTGTGTTGCGAGATGTCGTCGTTAAAGACACTCTTTCTCCCGGAGTCAC…à5’

3’à…CTCTACAGTATAAAGTTCTAGTAAGAGCACAAACTCCTGGACAATTC…à5’

Note that DNA sequences are read in a specified direction (noted 3’à5’ in the above sequences).

Nucleotides are considered in complementary (attracted) pairs. A is complementary to T, and G to C.

Each DNA sequence strand has a ‘complementary’ sequence strand where complementary nucleotide pairs are arranged face-to-face in reversed direction to compose part of the well-known helicoidal presentation.

Example:

The complementary strand for

3’à…GCAATGGTTTCTTà5’

5’ß…CGTTACCAAAGAAß3’

A DNA sequence has its mapped amino acid (protein) sequence. In this mapping, different combinations of three nucleotides (A, T, G or C) compose codons, each mapped to one amino acid (one amino acid may have several codon mappings… see below).

Here are some codon/amino acid mappings:

Amino acid	Amino acid Symbol	Codon
Alanine	A	GCA GCC GCG GCT
Arginine	R	AGA AGG CGA CGC CGG CGT
Asparagine	N	AAC AAT
Methionine	M	ATG

A sequence may have several “reading frames” in which amino acid codons mapping interpretation may vary. Essentially, to read a sequence, you must first find a starting codon… otherwise; your sequence is the representation of nothing in biological life (real world!)

Another interesting aspect is that a specific DNA sequence (with ‘significant’ length) seems to represent one part of a global ‘predefined’ structure… that is, in a way, similar to a known melody in music. If you start to play the first specific notes of, say, a Beatles’ song, everyone can tell the rest of the melody. If your first notes are not specific enough, the result may lead to so many different melodies.

This phenomenon is used in PCR (Polymerase Chain Reaction) in molecular biology engineering to amplify or clone DNA sequences using partial sequences (rigorously selected!).

Useful lessons for software design

Many (if not all J) real world problems seem to follow DNA sequences structure and representations:

§ Basic elements (A, T, G, C)

§ Related each to one another (AT, GC)

§ Composing a global sequence (unique for a specific area)

§ The sequence can be presented differently (DNA /Amino acid)

§ The translation between representations is done through specific mappings (codons / amino acids)

PCR technique also seems of great interest to some software areas (OCR recognition for example)

I will continue, in future posts, to explore these fascinating structures and propose some applications that mimic and benefit from their behaviors.

Mon	Tue	Wed	Thu	Fri	Sat	Sun
23	24	25	26	27	28	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31	1	2	3	4	5