Taoffi's blog

prisonniers du temps

covid-5ync – dna search using Linq

This post, follows in the series about technical details of covid-5ync (an open source application contribution in the fight against the new covid-19 virus).

This time, our talk is about searching nucleotides within a dna sequence.

Search is an important task for analyzing a sequence. Actually, to understand the structure of a genome, one task is to locate and classify identical regions and complementary regions (complementary regions = where nucleotides are paired: aót, góc). From the IT point of view, such a task is obviously related to 'search'… and should particularly be optimized.

Using Linq

To locate a sequence of nucleotides (i.e. a string) within a (global) sequence, we may simply iterate into the global sequence nodes (each of which is a char) until we find the searched string. To find all occurrences, we can repeat the process starting at the next location and so on.

Despite the artisanal aspect of the process, that works well for searching a predefined string.

Another task that looks related to search, is to find 'repeats' (Repeats are identical regions on a sequence).

It is somehow different from searching a predefined string. The application actually has to iterate through the sequence, and at each position take a number of nucleotides to compose a string to be searched all over the sequence… (and keep coordinates of the found occurrences).

Proposed solution

In covid-5ync, a dna sequence is a List<dna node> where each node contains an Index (int). Using Linq, we can define a method that returns the string of a given length at a given location (node Index) of the sequence:

public string StringAtIndex(int index, int len)
{
    string str = "";
    var nodes = this.SkipWhile(i => i.Index < index).Take(len);

    foreach(var n in nodes)
        str += n.Code.ToString();

    return str;
}

 

With this method in place, we can efficiently locate all starting nodes of occurrences of a string on the sequence:

IEnumerable<iDnaNode> AllStartoccurrencesOfString(string str)
{
    int len    = str.Length;
    return this.Where(i => StringAtIndex(i.Index, len) == str);
}

 

Sample usage: locate and select occurrences of a string

// get all starting nodes of the string
Var allStarts = AllStartoccurrencesOfString(str).ToList();

// visit the found starting nodes and select (str.Length) consecutive nodes…
foreach( var item in allStarts)
{
    var subSeq = this.SkipWhile(n => n.Index <  item.Index).Take(str.Length);
    subSeq.SelectAllNodes();
}