Two days ago, my friend Olivier Vallée sent me an article (nytimes) talking about the emerging role of software tools in reducing costs of lawyers’ work by analyzing documents contents. This reminded me the work we did, Olivier and I, some years ago around the use of a nucleotidic approach for analyzing textual data.
So, here again to continue the subject of the earlier post!
In part I, I presented a first view of the analogies between molecular biology elements and other real world problematic. And a probable benefit of using molecular biology’s structures and be behaviors to in data modeling and software solutions architecture.
In this part, I will try to expose a simple and direct application of this approach: language modeling and analysis. There are several advantages for starting with textual analysis as a first application:
· First, a textual object structure can easily, and somehow directly, be mapped to molecular biology elements (nucleotides, DNA strands, amino acids…);
· Second, the textual analysis is becoming (re-becoming!) a must know (and probably lucrative as well J) technology (see nytimes article);
Let us first try to analyze the physical (organic) structure of a textual object:
· Composing Words;
· Composing phrases;
· Composing paragraphs;
· … etc;
This same physical structure can be interpreted according to other (different / parallel) logical components:
· Symbols (óCharacters);
· Composing phonemes (ówords);
· Composing expressions;
· Composing sense (social / intellectual orientation (or mood) of the text)
Again, this parallel presentation of a same object (encountered in so many data modeling problematics) is one of the important analogies with molecular biology elements (i.e. similar, for example, to nucleotide sequences mapping to amino acids).
To illustrate another aspect of this analogy, let us take a look at the characters / words ó characters / phonemes parallel. To correctly read a phoneme, we need to locate its start-end characters. Which may overlap or be part of the ‘Words’ sequence of our text. In some way, we need to retrieve the reading frame (or phasing) of a sequence of phonemes through our nucleotidic characters structure. This phasing is independent of the characters/words sequence presentation. Which, again, brings to our mind the question of reading frames of an amino acid sequence contained in a nucleotidic sequence.
In the case of our text sequence question, characters, words, phonemes… etc. are, of course, language-dependent. However, for sake of simplicity, let us stay in the context of the English language.
Let us now take an example of how we can read a ‘mood’ sequence inside a physical text sequence (characters, words…) sequence.
The mood (social sense) of a text can be detected through the interpretation (or translation) of specific (predefined) expressions sequence(s).
“That’s great” can be interpreted differently from “hey… fantastic”
Or “Hello”, differently from “Hi”…
Or “L” differently from “J”
Social expression of a text sequence is also, in some way, related to its internal phoneme sequence. Text phonemes actually produce a sequence of sounds that give a particular internal ‘music’ to the text. Which participates, in the end, in transmitting the specific social sense of the initial text sequence. That is probably an important aspect of poetry (?)
I will provide a practical coded example in a future post.