Taoffi's blog

prisonniers du temps

covid-5ync – coronavirus curiosities!

Just noticed this while working and debugging covid5ync app. Don't know if that may have sense from a biotechnology view... still, it worth to be mentioned:

  • A fragment (15 nucleotides) of the RNA sequence looks to be a 'palindrome' -:) of itself… Position 22901 -22916
    Note: the top part of the figures below reads sequence's nucleotides in the direction 5'->3', the lower should be read in the reverse direction 3'->5' (which is the complementary of the upper part (i.e. 'a' complementary to 't' and 'g' to 'c'))…

  • A similar 'palindrome' fragment (15 nucleotides): 6089 - 6104

  • Similar with a small shift: (5744 – 5764)

covid-5ync – the fight against the crowned virus!

I continue posting about covid-5ync app (a helper app contribution in the fight against the latest coronavirus epidemic)

A major step now done: serializing application's information into xml files!

Hopefully this will allow transmitting research session work among members of our target community: biotechnology engineers.

I also downloaded (@NCBI site) a recent version of the virus RNA sequence (referred to as: MT163719 29903bp RNA linear VRL 10-MAR-2020). A daily work is done by the application to retrieve significant regions on the sequence. The updated version of the (xml) file is then uploaded to the project's web site which will allow gaining time in some tedious analysis tasks.

Serialization / deserialization

Reminder: Technically speaking, the sequence's data structures can be summarized as:

  • Sequence: (that is a collection of nodes, each referring to an item of collection of bases)
    • Identification and info of the sequence (Id, Name, Summary…)
    • A collection of named-regions (identified by start / end index + name and summary)
    • A collection of sub-sequences of 'repeats' occurrences
    • A collection of sub-sequences of hairpins (or 'pair-repeats') occurrences

 

To serialize these information… (in regards of the urgency matters) I decided to simply use a DataContractSerializer (which, as I said in other posts, is not really the best solution for extensibility).

The encountered difficulty was on several aspects:

  • Some information is redundant, notably for sub-sequences (collections of repeats and hairpins)
  • The sequence object being itself a collection of nodes, the serializer did not really allow an easy way to handle its sub-objects. Here we fall into a known problem of cycling!

 

The solution used was:

  • Transform sub-sequences to region lists (start/end indexes + name + occurrences) and deserialize back to sub-sequences
  • Create a serialization-specialized object (using DataContract / DataMember attributes) and process the serialization through that object to avoid the 'cycling' errors.

 

The second difficulty was errors encountered on deserialization while locating and assigning various regions start/end indexes to the sequence's nodes.

Actually, the deserialization submits the sequence's node as a string to be parsed asynchrony. And as you may have guessed, the collection of sequence's nodes was being altered while locating the regions' nodes!

The solution:

  • The Parse method raises Parse Complete event
  • Subscribe to that event and proceed to these operations.

 

That is the short story… will keep you posted about more details later this week!

KEEP SAFE: wash your hands + do not touch your face + keep hope: Humanity will prevail!

covid-5ync – dna search using Linq

This post, follows in the series about technical details of covid-5ync (an open source application contribution in the fight against the new covid-19 virus).

This time, our talk is about searching nucleotides within a dna sequence.

Search is an important task for analyzing a sequence. Actually, to understand the structure of a genome, one task is to locate and classify identical regions and complementary regions (complementary regions = where nucleotides are paired: aót, góc). From the IT point of view, such a task is obviously related to 'search'… and should particularly be optimized.

Using Linq

To locate a sequence of nucleotides (i.e. a string) within a (global) sequence, we may simply iterate into the global sequence nodes (each of which is a char) until we find the searched string. To find all occurrences, we can repeat the process starting at the next location and so on.

Despite the artisanal aspect of the process, that works well for searching a predefined string.

Another task that looks related to search, is to find 'repeats' (Repeats are identical regions on a sequence).

It is somehow different from searching a predefined string. The application actually has to iterate through the sequence, and at each position take a number of nucleotides to compose a string to be searched all over the sequence… (and keep coordinates of the found occurrences).

Proposed solution

In covid-5ync, a dna sequence is a List<dna node> where each node contains an Index (int). Using Linq, we can define a method that returns the string of a given length at a given location (node Index) of the sequence:

public string StringAtIndex(int index, int len)
{
    string str = "";
    var nodes = this.SkipWhile(i => i.Index < index).Take(len);

    foreach(var n in nodes)
        str += n.Code.ToString();

    return str;
}

 

With this method in place, we can efficiently locate all starting nodes of occurrences of a string on the sequence:

IEnumerable<iDnaNode> AllStartoccurrencesOfString(string str)
{
    int len    = str.Length;
    return this.Where(i => StringAtIndex(i.Index, len) == str);
}

 

Sample usage: locate and select occurrences of a string

// get all starting nodes of the string
Var allStarts = AllStartoccurrencesOfString(str).ToList();

// visit the found starting nodes and select (str.Length) consecutive nodes…
foreach( var item in allStarts)
{
    var subSeq = this.SkipWhile(n => n.Index <  item.Index).Take(str.Length);
    subSeq.SelectAllNodes();
}

 

covid-5ync – dna analysis helper application: progress!

A quick post to talk about the progress in this project with a few technical details.

First, a screen capture of what it looks like today (just a little better than before!):

 

The first version of the application is already online: click-once installation. And Yes, we don't have money to have a certificate for the deployment… install only if you trust!... any suggestions about sponsors are also greatly welcome!

 

A few features implemented:

  • Search for string and search for the 'complementary' of string.
  • Paging
  • Zoom-in / out on the sequence

Next steps:

  • Search enhancement: visually locate occurrences
  • Search for unique fragments according to user-provided settings
  • Open file / save selection

Long-term:

  • … endless

 

Now, let us take a quick dive into the mechanics

The first class diagram (updated source code with such documentation is @github)

That seems quite basic(!), actually, as the reality of DNA, and that is one reason it is also fascinating!

  • A sequence (iDnaSequence) is composed of 'nodes' (iDnaNode)
  • A node is noted as a char. It refers to a 'base'.
  • Commonly a set of 4 bases is used ('a', 't', 'g' and 'c'). There may be more, but we can easily handle this when needed.
  • Each individual base has a complementary 'Pair'. A is pairable with T, C with G
  • The 'standard' set of bases (iDnaBaseNucleotides) is a (singleton) list of the 4 bases. It sits as the main reference for nodes. It provides answers to important questions like: is 'c' a valid base?... what the complementary of 'X'?... what is the complementary sequence of the string "atgccca"? and so on.

 

Visual presentation: a start point

There are many ways to present a DNA sequence. To start with something, let us assign a color to each base. The user can later change this to obtain a view according to his or her work area. Technically speaking, we have some constraints:

  • The number of nucleotides of a sequence can be important. For coronavirus, that is roughly 29000. We therefor need 'paging' to display and interact with a sequence.
  • Using identification colors for nucleotides can also help to visually identify meaningful regions of the sequence on hand. For this to be useful, we need to implement zoom-in/out on the displayed sequence.

 

Paging

I simply used the solution exposed in a previous post about doc5ync.

 

Zoom-in / out

I found a good solution through a discussion on Stack Overflow
(credit: https://stackoverflow.com/users/967254/almulo).

  • Find the scroll viewer of the ListView.
  • Handle zoom events as required
var presenter        = UiHelpers.FindChild<ScrollContentPresenter>(listItems, null);
var mouseWheelZoom = new MouseWheelZoom(presenter);
PreviewMouseWheel += mouseWheelZoom.Zoom;

 

Sample screen shots of a zoom-out / in

More details later in a future post.

Please send your remarks, suggestions and contributions on github.

Hope all that will be useful in some way… Time is running… Humanity will prevail

covid-5ync – tackling covid-19

NCBI (National Center for Biotechnology Information) offers a vast database of DNA sequences.

I visited their site to see if I can find information about the last version of the menacing coronavirus.

Yes, that is available. It is precisely named: coronavirus 2 (SARS-CoV-2), and a long list of its DNA sequences is there.

I had worked, long years ago, on DNA sequences analysis and felt like giving a try to see how that sequence can be presented… just to see!

My old app (MFC, C++ app) could open and analyze the sequence. DNA sequence analysis is a long story. As far as I could learn: locate repeats (fragments that are repeated on the sequence), locate 'hairpins' (fragments of complementary nucleotides: a<=>T and g<=>c)… etc.

The old app did not seem quite handy to manipulate the downloaded sequence, so I started writing a new WPF one.

A few hours later, the app could display the sequence in a somehow 'visual appealing' UI, which invited to go ahead for some more significant work.

Covid-19 is not for fun!

Yes, it is not really for fun! I am not yet sure how such work can be useful, but whatever effort everyone can provide might be of help in defeating this new danger. Let us start and see!

For now, what I intend to do is:

  • Port the biotechnology features of the old app to a new handy UI;
  • Publish the app online for biotechnology engineers working on the subject: and get their feedback
  • Upload the source code to github for IT community feedback and contributions

It is a very small step in a long way to defeat that epidemic.

More on this in the next few days / weeks.

Be safe!

Learning from Nature: Towards a Nucleotidic modeling - part II

Two days ago, my friend Olivier Vallée sent me an article (nytimes) talking about the emerging role of software tools in reducing costs of lawyers’ work by analyzing documents contents. This reminded me the work we did, Olivier and I, some years ago around the use of a nucleotidic approach for analyzing textual data.

 

So, here again to continue the subject of the earlier post!

 

In part I, I presented a first view of the analogies between molecular biology elements and other real world problematic. And a probable benefit of using molecular biology’s structures and be behaviors to in data modeling and software solutions architecture.

In this part, I will try to expose a simple and direct application of this approach: language modeling and analysis. There are several advantages for starting with textual analysis as a first application:

·         First, a textual object structure can easily, and somehow directly, be mapped to molecular biology elements (nucleotides, DNA strands, amino acids…);

·         Second, the textual analysis is becoming (re-becoming!) a must know (and probably lucrative as well J) technology  (see nytimes article);

Let us first try to analyze the physical (organic) structure of a textual object:

·         Characters;

·         Composing Words;

·         Composing phrases;

·         Composing paragraphs;

·         … etc;

 

This same physical structure can be interpreted according to other (different / parallel) logical  components:

·         Symbols (óCharacters);

·         Composing phonemes (ówords);

·         Composing expressions;

·         Composing sense (social / intellectual orientation (or mood) of the text)

 

Again, this parallel presentation of a same object (encountered in so many data modeling problematics) is one of the important analogies with molecular biology elements (i.e. similar, for example, to nucleotide sequences mapping to amino acids).

To illustrate another aspect of this analogy, let us take a look at the characters / words ó characters / phonemes parallel. To correctly read a phoneme, we need to locate its start-end characters. Which may overlap or be part of the ‘Words’ sequence of our text. In some way, we need to retrieve the reading frame (or phasing) of a sequence of phonemes through our nucleotidic characters structure. This phasing is independent of the characters/words sequence presentation. Which, again, brings to our mind the question of reading frames of an amino acid sequence contained in a nucleotidic sequence.

 

In the case of our text sequence question, characters, words, phonemes… etc. are, of course, language-dependent. However, for sake of simplicity, let us stay in the context of the English language.

 

Let us now take an example of how we can read a ‘mood’ sequence inside a physical text sequence (characters, words…) sequence.

The mood (social sense) of a text can be detected through the interpretation (or translation) of specific (predefined) expressions sequence(s).

For example:

“That’s great” can be interpreted differently from “hey… fantastic”

Or “Hello”, differently from “Hi”…

Or “L” differently from “J

… etc.

 

Social expression of a text sequence is also, in some way, related to its internal phoneme sequence. Text phonemes actually produce a sequence of sounds that give a particular internal ‘music’ to the text. Which participates, in the end, in transmitting the specific social sense of the initial text sequence. That is probably an important aspect of poetry (?)

 

I will provide a practical coded example in a future post.

Learning from Nature: Towards a 'Nucleotidic' modeling - part I

 

 

 

During several years, I worked on a DNA sequence-analysis software project where I learned about DNA structures and several DNA engineering techniques.

(That is not to be confused with what is commonly called ‘genetic algorithms’ and ‘genetic programming’).

 

Elements and structures used (and manipulated) in molecular biology engineering sphere are fascinating and, above all, source of interesting knowledge. In some way, they define the representation of ‘Life hood’ structures and mechanisms for every living creature (plant, animal, bacteria, virus…)

Molecular biology teaches us that ‘Life’ is based on few nucleotides: A, T, G and C.

Like in music partitions, sequences composed of these few number of nucleotides can give unlimited number of combinations each representing the structure of a specific biological function and, in the end, a specific individual being.

 

Here are some sample (random) fragments of DNA sequences:

 

3’à…GCAATGGTTTCTTACTGTGGAGGACATAAAAATACAGCAAGGGT…à5’

3’à…TTGTGTTGCGAGATGTCGTCGTTAAAGACACTCTTTCTCCCGGAGTCAC…à5’

3’à…CTCTACAGTATAAAGTTCTAGTAAGAGCACAAACTCCTGGACAATTC…à5’

 

Note that DNA sequences are read in a specified direction (noted 3’à5’ in the above sequences).

Nucleotides are considered in complementary (attracted) pairs. A is complementary to T, and G to C.

Each DNA sequence strand has a ‘complementary’ sequence strand where complementary nucleotide pairs are arranged face-to-face in reversed direction to compose part of the well-known helicoidal presentation.

 

Example:

The complementary strand for

3’à…GCAATGGTTTCTTà5’

Is

5’ß…CGTTACCAAAGAAß3’

 

A DNA sequence has its mapped amino acid (protein) sequence. In this mapping, different combinations of three nucleotides (A, T, G or C) compose codons, each mapped to one amino acid (one amino acid may have several codon mappings… see below).

Here are some codon/amino acid mappings:

 

Amino acid

Amino acid Symbol

Codon

Alanine

A

GCA

GCC

GCG

GCT

Arginine

R

AGA

AGG

CGA

CGC

CGG

CGT

Asparagine

N

AAC

AAT

Methionine

M

ATG

 

A sequence may have several “reading frames” in which amino acid codons mapping interpretation may vary. Essentially, to read a sequence, you must first find a starting codon… otherwise; your sequence is the representation of nothing in biological life (real world!)

 

Another interesting aspect is that a specific DNA sequence (with ‘significant’ length) seems to represent one part of a global ‘predefined’ structure… that is, in a way, similar to a known melody in music. If you start to play the first specific notes of, say, a Beatles’ song, everyone can tell the rest of the melody. If your first notes are not specific enough, the result may lead to so many different melodies.

This phenomenon is used in PCR (Polymerase Chain Reaction) in molecular biology engineering to amplify or clone DNA sequences using partial sequences (rigorously selected!).        

 

Useful lessons for software design

Many (if not all J) real world problems seem to follow DNA sequences structure and representations:

§  Basic elements (A, T, G, C)

§  Related each to one another (AT, GC)

§  Composing a global sequence (unique for a specific area)

§  The sequence can be presented differently (DNA /Amino acid)

§  The translation between representations is done through specific mappings (codons / amino acids)

 

PCR technique also seems of great interest to some software areas (OCR recognition for example)

 

I will continue, in future posts, to explore these fascinating structures and propose some applications that mimic and benefit from their behaviors.