Taoffi's blog

prisonniers du temps

doc5ync–Trie database integration process

I continue here the excursion around using the Trie pattern and structures to index e-book words for the doc5ync project.

If you missed the beginning of the story, you can find it Here, Here and Here

The role of the client integration tool (a WPF app) is to pull e-books information to be indexed from the database, proceed to indexing the words and creating the links between each word and its related e-book. This is done using some settings: the language to index, the minimum number of chars to consider a sequence as a ‘word’… etc.

trie-with-data-db-integration-process

The integration process flow is quite simple:

  • Once we are happy with the obtained results, we use the tool to push the trie to the database in a staging table.
  • A database stored procedure can then extract the staging data into the tables used for presenting the index on the project web page.

trie-web-page

The staging table has a few fields:

  • The word string
  • The related e-book ID (relationship => docs table (e-books))
  • The number of occurrences of the word
  • The timestamp of the last insertion

The only difficulty encountered was the number of records (often tens of thousands) to push to the staging table. The (artisanal!) solution was to concatenated values of  blocks of records to be inserted (I.e.:  ‘insert into table(field1, field2, …) values ( v1, v2, …), (v3, v4, …), …’ etc.). Sending 150 records per command seemed to be a sustainable choice.

The staging table data is to be dispatched into two production tables:

  • doc5_trie_words:
    • word ID
    • language ID
    • word string
    • word’s number of occurrences
    • comments

 

  • doc5_trie_word_docs:
    • word ID (relationship => the above table)
    • e-book ID (relationship => docs (e-books) table)

 

Once the data is in the staging table, the work of the stored procedure is quite straightforward:

  • Delete the current words table (which cascade deletes the words / docs reference records)
  • Import the staging word (strings and occurrences) records into doc5_trie_words
  • Import the related word / doc IDs into doc5_trie_word_docs.

Many words are common between languages and e-books. Therefore assigning a language to a word has no sense unless all its related documents are from one specific language. That is the additional and final task of the stored proc.

Next step: the index web page presentation!

That will be the subject of the next post!

doc5ync Trie integration tool - UI Tag cloud, paging and navigation

The integration client tool I talked about in the previous post, we need to display the list of words in a way similar to tag clouds in blog post.

For this we will use a ListView with some customization (see Xaml code below).

Paging

A more important question is the number of items to show. As we, in most cases, have thousands of items to display, we need a paging mechanism.

A solution – a Linq extension - proposed by https://stackoverflow.com/users/69313/wim-haanstra was a good base for a generic paging module:

 

// credit: https://stackoverflow.com/users/69313/wim-haanstra
// usage: MyQuery.Page(pageNumber, pageSize)
public static class LinqPaging
{
    // used by LINQ to SQL
    public static IQueryable<TSource> Page<TSource>(this IQueryable<TSource> source, int page, int pageSize)
    {
        return source.Skip((page - 1) * pageSize).Take(pageSize);
    }

    // used by LINQ
    public static IEnumerable<TSource> Page<TSource>(this IEnumerable<TSource> source, int page, int pageSize)
    {
        return source.Skip((page - 1) * pageSize).Take(pageSize);
    }

}

trie-with-data-paging-base

iObjectPaging is now a generic class accepting any collection of data to be paged through calls to the Linq extension.

Let us derive, from this base, a two specific paging classes: one for our words and another for our documents (DataItems):

trie-with-data-paging-2

That is all what we need for paging. Each list will simply assign its data to the corresponding paging object and the UI elements will be bound to the CurrentPageData collection. Next / Previous buttons will allow navigating through the collection pages.

The main view model, for instance, declares a paging member:

protected iWordPaging	_wordPaging	= new iWordPaging(200);

And assigns its Words collection to this paging member whenever the collection changes:

_wordPaging.SourceCollection	= ItemList?.AllWords;

 

Words as Tag Cloud

A ListView control should be customized for this.

We need to customize its ItemsPanel and ItemTemplate:

<ListView x:Name="listItems" 
			Grid.Row="1" 
			ItemsSource="{Binding WordPaging.CurrentPageData, IsAsync=True}" 
			BorderBrush="#FFA3A3A4" BorderThickness="1"
			SelectedItem="{Binding SelectedWord, Mode=TwoWay, IsAsync=True}" Background="{x:Null}"
			Padding="12" ScrollViewer.HorizontalScrollBarVisibility="Disabled"
			>
   <ListBox.ItemsPanel>
       <ItemsPanelTemplate>
          <WrapPanel MaxWidth="{Binding ElementName=listItems, Path=ActualWidth, Converter={StaticResource widthConverter}}" HorizontalAlignment="Left" Height="auto" Margin="12,0,12,0" />
       </ItemsPanelTemplate>
   </ListBox.ItemsPanel>

   <ListView.ItemTemplate>
       <DataTemplate>
          <local:iWordCtrl DataContext="{Binding }" Width="120" Height="40" />
       </DataTemplate>
    </ListView.ItemTemplate>
  </ListView>
 

Data items grid

A DataGrid bound to the selected Word data items will display its (paged) data items:

<DataGrid ItemsSource="{Binding CurrentPageData, IsAsync=True}">
   <DataGrid.Columns>
    …
    …


 

With this in place, we are now able to:

  • Navigate though word pages
  • When the selected word changes, its related data (paged) items (e-books) are displayed in the DataGrid…
  • Next / Previous buttons can be used to navigate, and will be enabled or disabled according to the paging context (see paging base class in the diagram above)
  • A list of pages (combo box) can also allow to go to a specific page

 

trie-with-data-paged-word-cloud

 

Sample paged datagrid of e-books containing the selected word

trie-with-data-paged-datagrid

In a next post, we will see the database integration process

doc5ync Trie index integration tool

That is a maintenance WPF client application for indexing words found in e-book titles and descriptions for the doc5ync project. (http://doc5.5ync.net/)

Before talking about technical details, let us start by some significant screenshots of the app.

1. scanning All languages’ words for 10000 data records with minimum words length of 4 chars.

trie-with-data-window1

After the scan, words are displayed (on the left side of the above figure) highlighting the occurrences of each word (greater font size = more occurrences). This done using a user control itself using a converter.

trie-with-data-word-control

A simple Border enclosing a TextBlock

<UserControl.Resources>
	<conv:TrieWordFontSizeConverter x:Key="fontSizeConverter" />
</UserControl.Resources>
<Grid x:Name="grid_main">
	<Border BorderBrush="DarkGray" CornerRadius="2" Background="#FFEDF0ED" Height="auto" Margin="2" BorderThickness="1">
		<TextBlock Text="{Binding Word}"
					Padding="4px"
					VerticalAlignment="Center"
					HorizontalAlignment="Center"
					FontSize="{Binding ., Converter={StaticResource fontSizeConverter}, FallbackValue=12}"
					>
		</TextBlock>
	</Border>

	</Grid>
 
 
The converter emphasizes the font size relative to the word’s occurrences:
 
public class TrieWordFontSizeConverter : IValueConverter
{
    public object Convert(object value, Type targetType, object parameter, CultureInfo culture)
    {
        double minFontSize = 11.0,
              defaultFontSize = 12.0,
              maxFontSize = 32.0,
              size;
         if (System.ComponentModel.DesignerProperties.GetIsInDesignMode(new DependencyObject()))
            return defaultFontSize;

         double min = (double) iWordsCentral.Instance.MinOccurrences,
                max = (double) iWordsCentral.Instance.MaxOccurrences;
         iTrieWord word = value as iTrieWord;

         if( word == null)
            return defaultFontSize;

         max = Math.Min(9, max);
        size = (word.Occurrences / max) * maxFontSize;

        if(size > maxFontSize)
            return maxFontSize;
        if(size < minFontSize)
            return minFontSize;
        return size;
}

Load, Scan and link words to data items

The View Model objects and processing flow

trie-with-data-view-model

iWordsCentral is the ‘main’ view model (singleton) which provide word scanning and data object assignment through its ScanWordsData (iData object)

ItemList (iDataItemList) is iData’s member responsible for building the Trie (its member) and assigning Trie’s words to its data items.

On Load button click, the MainWindow calls its LoadData() method.

 
async void ReloadData()
{
	await Task.Run(() => iWordsCentral.Instance.LoadData());
}
 

The method loads data records into (the desired number of records is a parameter… see main figure) and assign it to the ItemsList of the scan object (iData), then calls the iData’s method to build the Trie and assign data items to each of the Trie’s nodes.

 

_scanWordsData.ItemList	= rootList;

bool scanWordsResult = await _scanWordsData.ScanDataWordsAsync(_minWordLength, _includeDocAreaWords, _cancelSource.Token);

 

The iData object calls its ItemList to do the job… its method proceeds as in the following code

public async Task<bool> ScanDataWordsAsyn(int minWordLength, bool scanRootItems, CancellationToken cancelToken)
{
    if(_trie == null)
        _trie = new iTrie();

    // build a single string with all textual items and parse its words
	iTrie		trie			= _trie;
	string		global_string	= "";

        foreach( iDataItem item in this)
        global_string	+= item.StringToParse;
        await Task.Run(() => _trie.LoadFromStringAsync(global_string, minWordLength, notifyChanges: false));

        _trie.Sort();
        List<iTrieWord>	trieWordList	= trie.AllStrings;

        // copy the Trie words (strings) to a DataTrieWord list
	CopyDataWords(trieWordList);

        // assign words to data items
        bool result = await AssignTrieWordsDataAsync(scanRootItems, cancelToken);
	return result;
}


 

The data Item List loops through all its words and data items, calling each data item to assign itself to the given word if it is contained in its data

foreach (var word in _dataWordList)
{
   foreach (var ditem in this)
      await ditem.AssignChildrenTrieWordAsyn(scanRootItems, word, cancelToken);
}

The data item looks for any of its data where a match of the given word is found and assign those items to the word:

var wordItems	= this.Children.Where( i => i.Description != null 

				&& i.Description.IndexOfWholeWord( word.Word) >= 0);

IndexOfWoleWord note

That is (an efficient) string extension which is important to ensure that one whole word is present in data. I struggled to find a solution for this question, and finally found an awesome solution proposed by https://stackoverflow.com/users/337327/palota

 

// credit: https://stackoverflow.com/users/337327/palota
public static int IndexOfWholeWord(this string str, string word)
{
    for (int j = 0; j < str.Length && 
        (j = str.IndexOf(word, j, StringComparison.Ordinal)) >= 0; j++)
        if ((j == 0 || !char.IsLetterOrDigit(str, j - 1)) && 
            (j + word.Length == str.Length || !char.IsLetterOrDigit(str, j + word.Length)))
            return j;
    return -1;
}
 

Finally, as you may have noticed, for performance measurement, a simple StopWatch is embedded into the main view model to notify elapsed time during the process. For this to have sense, all methods are of course async notifying changes through the UI thread (Dispatcher). You might ignore all the async artifacts in the above code to better concentrate on the processing steps themselves.

Presentation

Once all the processing is done, there is still the presentation UI work to do in order to display the document list of a selected word.

This will be the subject of a next post.

 

doc5ync – the Trie in practice for online e-books

I spent the past few months working on a new web project referencing online e-books (http://doc5.5ync.net/)

The goal of the project was not to build a new online library (many good libraries are already out there) but rather to offer a central reference for all what exists, adding some features for these references to provide a new analytical view of e-books.

Most of online libraries offer access to books that are now in the ‘public domain’ (I.e. no more copyright protected) and thus available for free download.

For an analytical approach, I started to use the Trie structure (I talked about this in a previous post) for analyzing textual elements of the referenced e-books to provide relational aspects among them.

Just a reminder, explained in the previous post: a Trie is a tree-like structure where a node has a parent, neighbors and descendants. The structure is particularly interesting for text indexing because, whatever the language, any textual unit (word) is forcibly composed of a set of that language’s alphabet (whose number is quite limited). Adding a flag to end-of-word nodes, we can build a Trie whose root is composed of the few units of the alphabet with branches to text words.

trie-word-nodes

This compact structure enables fast and efficient search and retrieve elements into large text sequences. Which seems to be a good base for our e-book text indexing and analysis.

Using the trie structure to index e-book details (titles, description, author…) of the relatively large number of referenced e-books (approx. 9000 as of writing) was straightforward and efficient.

Now, a given unit (word) in this trie might be related to one or more of our e-books. How to link our trie nodes each to its set of ‘data’? That is the subject of this brief post.

We are going to build upon the elements mentioned in in the previous post:

  • We will use our Trie with its (char) Dictionary and Nodes.
  • Our trie provides us with its words presented as a collection of iTrieWord objects
  • Let us create a new object iTrieDataWord (deriving from iTrieWord)
  • This last object will contain a collection of ‘Data items’ (in our concrete case, this will be a collection of e-books)

trie-with-data

How to proceed?

After some experimentations Smile, I ended up using the following steps which seemed to be good in regards of efficiency and performance:

  • Load all e-books’ textual sequences (titles, descriptions, author information… for the time being)
  • Build the Trie of this text sequences (more about this later)… which provides us with its Words (iTrieWord) collection
  • Now, in the loaded collection of e-book records (the iDataItem(s)). (Each record contains the e-book title, description and author information)… each record (iDataItem) can assign itself to any of the Trie words whenever that word is part of its own data.

Some additional considerations in the process are quite important:

    • One important point is to define “What is a ‘Word’”?  in terms of minimum number of characters to consider a sequence as a ‘word’. As the referenced e-books are multilingual, it was somehow clear that this threshold is language-dependent. In Arabic, for instance, words tend to be short in terms of number of characters (Arabic vowels are often part of the character). After some research, I found that considering 4 chars as a minimum is an acceptable compromise as it allows searching the e-books by year (author’s or book’s) which may be quite useful.
    • It is also important to define what are ‘word-delimiters’ (spaces are not the only ones to consider!). Actually, that is also language-dependent in some ways… and as such requires experimentations with all languages to be used in the given project.
    • Finally: what are we going to do for all this to b useful?... I.e. Are we going to persist this Trie? Or rather proceed as a (runtime queryable) indexing service?… etc. For doc5 project, we decided to persist the results in data tables / running the scan process periodically

Some performance numbers

Some numbers to justify using the above steps:

  • Reading data records + Building a Trie of 40365 words (min = 4 chars): 17s
  • Processing 9000 e-book information (I.e. building the Trie + creating 358000 links to its words): 8min30s

Will post some sample code in the coming weeks. You may have a look at http://doc5.5ync.net/ (The current version for presenting the results).

A bit late!: Wish you all a happy 2020 year, with many useful projects and much fun!

Jet.oledb.4.0 and utf8 bom story

As you may know, a text file may specify its encoding by a ‘bom’ (byte order mark) using several bytes at its beginning. For utf8 encoding the signature bytes are: 0xef, 0xbb, 0xbf.

I came across an issue while manipulating csv files, for which I decided to use utf8 (thinking that was a good choice for a multi-cultural environment!). The process involved reading and writing back to the same file after some insertions and updates. All using microsoft.jet.oledb.4.0 provider (with a schema.ini specifying CharacterSet=65001 (65001 being utf8 code page)).

My csv files had a header row of which the first column was, ironically, named ‘Match ID’. A few manipulations revealed a somehow strange behavior. Although the debugger showed that the first column’s name is ‘Match ID’, I could no more access ‘Match ID’ column by its name. Using the watch window, I asked the debugger:
myColumn.ColumnName == "Match ID"… it replied false… weird!

Viewing the column name's CharArray in hexa offers a more significant information:

 column name issue

That is evidently endless! With the time going, as you manipulate your columns and rewrite back to the csv file, you end up by having your 'Match ID' column prefixed by bytes from the utf8 bom code as many times you rewrite the csv file. And if you are a nice guy who lets the users reorder the columns as they need, you may end up by having all your columns affected by that issue!

column name bytes

Changing files’ encoding to unicode (whose bom signature is 0xff 0xfe 0xff 0xfe) does not reproduce the issue. Which makes it clear that the source of annoyance is not utf8 but rather the jet.oledb.4 data provider with utf8. Still, identifying the source is half way of solving the issue :).

How to solve this?

Well, you may think of ‘sanitizing’ your column names at every load! Which, in my view, does not seem quite practical.
In my case I just switched to unicode (despite more bytes waste!) to preserve multi-cultural data requirements.

xsl witness!

Transforming xml content through xsl stylesheets is a useful and relatively common feature in the development process. I talked about this in the previous post about OneNote pages html preview.
Searching in my personal code toolbox, I just found this ‘iXslWitness’, a tool I wrote a couple of years ago to check the effectiveness of a stylesheet in transforming xml to html. Its usage is quite simple: you select an xml file and the xsl stylesheet to use. And you get the html transformed content.
I hope that can be useful for anyone involved in such tasks!
A screenshot of transforming a OneNote page xml content (a list of ‘The World If’ publications of The Economist newspaper):

worldif2017-witness

You can download the tool Here.
The source code is Here.

OneNote page xsl transformation

As we saw in a previous post, OneNote API is xml-based. You call the OneNote Application object to obtain almost all needed information as xml strings.

One of that information is the page content. Whose schema is defined as explained in the previous post.

Once you get the page content’s xml string, it is a little bit of work to transform that into a useful html page

To do this, I used an xslt style sheet which is the subject of the current post.

How does it work?

Assume you have an xsl sheet string and the xml string of a page. You can then process both in a way similar to the following code:

 

using System.Xml.Xsl;

 

public static string XmlToHtml(string xmlString, string xslString)
{
    string            html;
    XslCompiledTransform    transform  = null;
    XmlReader         xslReader        = null,
                      xmlReader        = null;
    MemoryStream      memStream        = null;
    StreamReader      sr               = null;
    MemoryStream      xslStream        = null,
                      xmlStream        = null;
    byte[]            xslBytes         = Encoding.UTF8.GetBytes( xslString),
                      xmlBytes         = Encoding.UTF8.GetBytes( xmlString);

    xslStream    = new MemoryStream( xslBytes);
    xmlStream    = new MemoryStream( xmlBytes);

    transform    = new XslCompiledTransform();
    xslReader    = XmlReader.Create( xslStream);
    xmlReader    = XmlReader.Create( xmlStream);
    memStream    = new MemoryStream();

    transform.Load( xslReader);
    transform.Transform( xmlReader, null, memStream);

    memStream.Position    = 0;

    sr      = new StreamReader( memStream);
    html    = sr.ReadToEnd();
               
    xslStream.Close();
    xmlStream.Close();
    memStream.Close();
    sr.Close();
    xslReader.Close();

    return html;
}

 

 

 

How to get the page xml content?

Assuming you have the page ID, here is a sample code to query the page content through OneNote API:

 

public static string GetPageContentXmlString(string pageId)
{
    var         onenoteApp  = new Application();
    string      pageXml     = null;
    PageInfo    pageInfo    = PageInfo.piAll;

    onenoteApp.GetPageContent(pageId, out pageXml, pageInfo);
    return pageXml;
}

 

Reminder: The page content xsd schema

 

The xsl style sheet general structure

Let us use the iXml explorer to navigate through the xsl stylesheet used to transform the page’s xml into html.

(I searched for ‘match’ to locate templates defined in the stylesheet).

As you may notice, we have an xsl template to handle each OneNote page defined xsd type:

  • A template to process the page root information
  • A template to process the page’s title
  • A template to process outline elements
  • A template to process OEChildren collection
  • … and so forth

Sample xsl code

Process OneNote Element (one:OE)

  <!--
  ************************************************
  one:OE
  ************************************************
  -->
  <xsl:template match="one:OE">
    <xsl:param name="nest_level" select="0" />

    <xsl:variable name="listNode" select="./one:List" />
    <xsl:variable name="quickStyleIndex" select="./@quickStyleIndex" />
    <xsl:variable name ="styleNode" select="msxsl:node-set($quickStyleList)/quickStyle[@index=$quickStyleIndex]" />

    <xsl:variable name="quickStyle">
      <xsl:choose>
        <xsl:when test="$styleNode">
          <xsl:value-of select="$styleNode/@style"/>
        </xsl:when>
      </xsl:choose>
    </xsl:variable>

    <!-- is there any list here? -->
    <xsl:choose>
      <xsl:when test="$listNode">
        <xsl:variable name="number"    select="$listNode/one:Number" />
        <xsl:variable name="txt"      select="./one:T" />
        <xsl:variable name="listItemTag">
          <xsl:text>li</xsl:text>
        </xsl:variable>

        <!-- list tag: either <ol> or <ul> -->
        <xsl:variable name="listTag">
          <xsl:choose>
            <xsl:when test="$number">
              <xsl:text>ol</xsl:text>
            </xsl:when>
            <xsl:otherwise>
              <xsl:text>ul</xsl:text>
            </xsl:otherwise>
          </xsl:choose>
        </xsl:variable>

        <xsl:choose>
          <!-- numbered list? output <ol> -->
          <xsl:when test="$number">
            <xsl:variable name="txtNum"        select="number($number/@text)" />
            <xsl:variable name="fontNum"      select="$number/@font" />
            <xsl:variable name="tag" select="concat('<', $listTag, '>')" />
            <xsl:value-of disable-output-escaping="yes" select="$tag"/>

            <!-- *********** output <li value="xx"> ' style="font-family:', $fontNum, ';"',************* -->
            <xsl:value-of disable-output-escaping="yes" select="concat('<', $listItemTag, ' value=', '"', $txtNum, '">')" />

            <xsl:call-template name="outputListItem">
              <xsl:with-param name="listNode" select="$listNode" />
              <xsl:with-param name="itemText" select="$txt" />
            </xsl:call-template>
          </xsl:when>

          <!-- bullet ? output <ul> -->
          <xsl:otherwise>
            <xsl:variable name="tag" select="concat('<', $listTag, '>')" />
            <xsl:value-of disable-output-escaping="yes" select="$tag"/>
            <!-- *********** output <li> ************* -->
            <xsl:value-of disable-output-escaping="yes" select="concat('<', $listItemTag,'>')" />

            <xsl:call-template name="outputListItem">
              <xsl:with-param name="listNode" select="$listNode" />
              <xsl:with-param name="itemText" select="$txt" />
            </xsl:call-template>
          </xsl:otherwise>

        </xsl:choose>

        <!-- process list's sub items -->
        <xsl:apply-templates select="./one:OEChildren">
          <xsl:with-param name="nest_level" select="1 + $nest_level" />
        </xsl:apply-templates>

        <xsl:apply-templates select="./one:Table" />

        <!-- close the list item tag -->
        <xsl:value-of disable-output-escaping="yes" select="concat('</', $listItemTag,'>')" />
        <!-- close the list tag -->
        <xsl:value-of disable-output-escaping="yes" select="concat('</', $listTag, '>')"/>
      </xsl:when>

      <!-- no list: process all sub items -->
      <xsl:otherwise>
        <xsl:variable name="style0" >
          <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="./@style" />
            <xsl:with-param name="replace" select="''" />
            <xsl:with-param name="by" select="''" />
          </xsl:call-template>
        </xsl:variable>

        <xsl:variable name="alignment" select="./@alignment" />

        <xsl:variable name="style">
          <xsl:choose>
            <xsl:when test="$alignment='right'">
              <xsl:value-of select="concat('width:100%; float:right; display:inline; text-align:right;', $style0)"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:value-of select="$style0"/>
            </xsl:otherwise>
          </xsl:choose>
        </xsl:variable>

        <xsl:choose>
          <xsl:when test="string-length($style)>0">
            <span style="{$style}">
              <xsl:apply-templates />
            </span>
          </xsl:when>

          <!-- ****** no style defined -->
          <xsl:otherwise>
            <xsl:choose>
              <xsl:when test="$quickStyle">
                <span style="{$quickStyle}">
                  <xsl:apply-templates>
                    <xsl:with-param name="nest_level" select="1 + $nest_level" />
                  </xsl:apply-templates>
                </span>
              </xsl:when>

              <xsl:otherwise>
                <span>
                  <xsl:apply-templates>
                    <xsl:with-param name="nest_level" select="1 + $nest_level" />
                  </xsl:apply-templates>
                </span>
              </xsl:otherwise>
            </xsl:choose>
          </xsl:otherwise>
          <!-- ****** end of : no style defined -->
        </xsl:choose>

      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

Output a list (either numbered or bulleted)

<!--
  ************************************************
  output list item (numbered / bulleted)
  ************************************************
  -->
  <xsl:template name="outputListItem" match="one:List">
    <xsl:param name="listNode" />
    <xsl:param name="itemText" />

    <xsl:variable name="number"        select="$listNode/one:Number" />
    <xsl:variable name="bullet"        select="$listNode/one:Bullet" />
    <xsl:variable name="txtNum"        select="number($number/@text)" />

    <xsl:choose>
      <xsl:when test="$number">
        <xsl:value-of select="normalize-space($itemText)" disable-output-escaping="yes"/>
      </xsl:when>

      <xsl:otherwise>
        <xsl:value-of select="normalize-space($itemText)" disable-output-escaping="yes"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

 

Process page images

In the page’s xml, images are returned as base64 strings.

To process an image into html, the xsl template looks like the following:

  <!--
  ************************************************
  one:Image
  ************************************************
  -->
  <xsl:template match="one:Image">
    <xsl:variable name="imgWidth"select="substring-before( number(./one:Size/@width) * 1.33, '.')"/>
    <xsl:variable name="imgHeight" select="substring-before( number(./one:Size/@height) * 1.33, '.')"/>
    <xsl:variable name="imgData"    select="./one:Data" />
    <xsl:variable name="oneFormat"  select="./@format" />
    <xsl:variable name="htmlformat">
      <xsl:choose>
        <xsl:when test="$oneFormat='png'">
          <xsl:text>data:image/png;base64</xsl:text>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text>data:image/jpg;base64</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

    <xsl:variable name="htmlWidth">
      <xsl:choose>
        <xsl:when test="$imgWidth">
          <xsl:value-of select="$imgWidth"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text>96%</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

    <xsl:variable name="htmlHeight">
      <xsl:choose>
        <xsl:when test="$imgHeight">
          <xsl:value-of select="$imgHeight"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text>auto</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>


    <img width="{$htmlWidth}" height="{$htmlHeight}" src="{$htmlformat}, {$imgData}"/>
  </xsl:template>

 

Download the xsl stylesheet

You may download the entire xsl stylesheet (for OneNote 2013 and 2016) here

Json object explorer

[json objects =>to property bags =>to objects]

Reducing dependency between clients and services is a major common question in software solutions.

One important area of client/server dependencies lies in the structure of objects involved in exchanged messages (requests / responses). For instance: a new property inserted to an object on service side, often crashes the other side (client) until the new property is introduced on the involved object.

I previously posted about loose coupling through property bags abstractions.

My first approach was based on creating a common convention between service and client which implies transforming involved objects into property bags whose values would be assigned as needed to business objects at each side on runtime. That still seems to be a 'best solution' in my point of view.

Another approach is to transform the received objects (at either side: server/client) into property bags before assigning their values to the related objects.

This second approach is better suited for situations where creating a common convention would be difficult to put in place.

While working on some projects based on soap-xml messages, I wrote a simple transformer: [xml => property bags => objects].The transformer then helped write an xml explorer (which actually views xml content as its property bag tree. You can read about this in a previous post).

Another project presented a new challenge in that area, as the service (JEE) was using Json format for its messages. In collaboration with the Java colleagues, we could implement the property bag approach which helped ease client / server versioning issues.

A visual tool, similar to xml explorer, was needed for developers to explore json messages' structures. And that was time for me to write a new json <==>-property bag parser.

The goal was to:

  • Transform json content to property bags
  • Display the transformed property bag tree

Using Newtonsoft's Json library – notably its Linq extensions – was essential.

Hereafter the global dependency diagram of the Json explorer app:

 

The main method in the transformation is ParseJsonString to which you provide the string to be parsed.

Its logic is rather simple:

  • A json string is the representation of either:
    • An object:
      • Read its properties (which may contain arrays… see below)
    • Or an array of:
      • Objects: read the array's objects
      • Arrays: read the array's arrays

 

Code snippets

The parse json string method

public static PropertyBag ParseJsonString(string jsonString)
{
    JObject jObj = null;
    PropertyBag bag = new PropertyBag("Json");
    ObjProperty bagRootNode;
    JArray jArray = null;
    string exceptionString = "";

    /// the json string is either:
    /// * a json object
    /// * a json array
    /// * or an invalid string

    // try to parse the string as a JsonObject
    try
    {
        jObj = JObject.Parse(jsonString);
    }
    catch (Exception ex)
    {
        jObj = null;
        exceptionString = ex.Message;
    }

    // try to parse the string as a JArray
    if(jObj == null)
    {
        try
        {
            jArray = JArray.Parse(jsonString);
        }
        catch (Exception ex2)
        {
            jArray = null;
            exceptionString = ex2.Message;
        }

        if(jArray == null)
        {
            bag.Add(new ObjProperty(_exceptionString, null, false) { ValueAsString = exceptionString });
            return bag;
        }
    }

    bagRootNode = new ObjProperty("JsonRoot", null, false);

    if(bagRootNode.Children == null)
        bagRootNode.Children = new PropertyBag();

    bag.Add(bagRootNode);

    if(jObj != null)
    {
        bagRootNode.SourceDataType = typeof(JObject);
        bagRootNode.Children = ParseJsonObject(bagRootNode, jObj);
    }
    else if(jArray != null)
    {
        bagRootNode.SourceDataType = typeof(JArray);
        bagRootNode.Children = ParseJsonArray(bagRootNode, jArray);
    }

    return bag;
}

 

Parse json array code snippet

 

private static PropertyBag ParseJsonArray(ObjProperty parentItem, JArray jArray)
{
    if(parentItem == null || jArray == null)
        return null;

    ObjProperty childItem;

    if(parentItem.Children == null)
        parentItem.Children = new PropertyBag();

    PropertyBag curBag        = parentItem.Children;

    foreach(var item in jArray.Children())
    {
        JObject jo     = item as JObject;
        JArray subArray = item as JArray;
        PropertyBag childBag;
        Type nodeType = subArray != null ? typeof(JArray) : typeof(JObject);

        childItem    = new ObjProperty("item", parentItem, false) { SourceDataType = nodeType };

        if (jo != null)
            childBag = ParseJsonObject(childItem, jo);
        else if(subArray != null)
            childBag    = ParseJsonArray(childItem, subArray);
        else
            continue;

        curBag.Add(childItem);
    }

    return curBag;
}

 

Json to Xml

As, now, we have the json content in property bags, we can almost directly get the xml equivalent (see screenshot below).

The used sample json string: for the following screenshot:

 

{
    "web-app": {
    "servlet": [
    {
        "servlet-name": "cofaxCDS",
        "servlet-class": "org.cofax.cds.CDSServlet",
        "init-param": {
            "configGlossary:installationAt": "Philadelphia, PA",
            "configGlossary:adminEmail": "ksm@pobox.com",
            "configGlossary:poweredBy": "Cofax",
            …
            …
            "maxUrlLength": 500
            }
    },
    {
        "servlet-name": "cofaxEmail",
        "servlet-class": "org.cofax.cds.EmailServlet",
        "init-param": {
            "mailHost": "mail1",
            "mailHostOverride": "mail2"
            }
    },
    {
        "servlet-name": "cofaxAdmin",
        "servlet-class": "org.cofax.cds.AdminServlet"
    },

    {
        "servlet-name": "fileServlet",
        "servlet-class": "org.cofax.cds.FileServlet"
    },
    {
        "servlet-name": "cofaxTools",
        "servlet-class": "org.cofax.cms.CofaxToolsServlet",
        "init-param": {
            "templatePath": "toolstemplates/",
            "log": 1,
            …
            …
            "adminGroupID": 4,
            "betaServer": true
            }
        }
    ],
    "servlet-mapping": {
        "cofaxCDS": "/",
        "cofaxEmail": "/cofaxutil/aemail/*",
        "cofaxAdmin": "/admin/*",
        "fileServlet": "/static/*",
        "cofaxTools": "/tools/*"
        },

    "taglib": {
        "taglib-uri": "cofax.tld",
        "taglib-location": "/WEB-INF/tlds/cofax.tld"
        }
    }
}



Screenshot

 

You may download the binaries here!

The source code is available here!

OneNote Explorer

In January 2004, Chris Pratley wrote about "OneNote genesis". His article ended by this sentence: "…I can see how this might become addictive."

Yes, as many people who know OneNote often use the word, "addictive" is a correct adjective for OneNote.

What makes it addictive is probably the fact that it is a medium for 'not-yet-documented' ideas. In the same time, it offers a good and simple hierarchical storage that helps organize those ideas for future documentation.

One great thing is that OneNote exposes its objects and methods for developers through an API for extending its features. There are many feature-rich add-ins for OneNote, which can fit your needs in several areas.

In my case, I needed a sort of 'periscope' to explore my own notes. A tool that can let me see my notes ordered by creation or last update dates, mark and retrieve some of them as favorite items, have a quick preview of a note, locate the section's file folder, search notes' titles and/or content… etc.

Using the API, I could write a 'OneNote Explorer'… a tool I started writing in 2010, enriching it with new features from time to time.

OneNote API

OneNote API is xml-based. A set of methods in the Application Interface let you get information about opened notebooks and their structures (section groups, sections, pages… etc.) in xml format.
Understanding OneNote xsd schema is thus essential.

OneNote XSD overview: main objects

 onenote xsd

  • A Notebook is (similar to file folder) a sequence of:
    • Sections
    • And/or Section groups. Where a Section group itself is a sequence of:
      • Section groups
      • And/or Sections. Where a Section is a sequence of:
        • Pages.

OneNote Page definition

onenote page xsd

  • Apart from its attributes (see xsd elements above), a Page is a set of either:
    • An Image
    • A Drawing
    • A File
    • A Media
    • An Outline (similar to html main <div>) which is (somewhat simplified here) is a sequence of:
      • OE children (OneNote Elements). Each OE can be either:
        • An Image
        • A Table
        • Drawing
        • Or a sequence of:
          • T (text range)

Application overview

Application view model classes:

 application main classes

 

A static class (OneNoteHelpers) exposes several methods to communicate with OneNote API and create / update the view model objects as required:

 code map 1

Summary of methods exposed by the helper static class:

 onenote helpers

Features

  • View selected section pages. Search titles, sort the datagrid, add page to favorites, open page in OneNote, html preview

feature 1

  • View all notebook pages. Search titles, sort the datagrid, add page to favorites, open page in OneNote, html preview

feature 2

  • Search selected notebooks

feature 3

  • Manage favorites: delete, preview, open in OneNote…

feature 4

Page preview note

The application's Preview button displays the html content of the selected page. As OneNote API can return the page xml content, an xslt style sheet (with templates per each element type of the xsd definition) allows a simple and quick preview.

sample page 

The simple page's xml tree

sample page's xsd

The page xml code

 
<?xml version="1.0" encoding="utf-8"?>
<one:Page xmlns:one="http://schemas.microsoft.com/office/onenote/2013/onenote"
         ID="{138E40BA-13BB-4D24-A78C-D92E4E23D574}{1}{E1949424590587215702781963951539196781170461}"
         name="Sample page title" dateTime="2018-04-20T18:53:38.000Z"
         lastModifiedTime="2018-04-20T19:00:30.000Z"
         pageLevel="1"
         isCurrentlyViewed="true"
         selected="partial"
         lang="en-US">
<!-- ***************** quick styles ***************************** -->
<one:QuickStyleDef index="0"
                     name="PageTitle"
                     fontColor="automatic"
                     highlightColor="automatic"
                     font="Calibri Light"
                     fontSize="20.0" spaceBefore="0.0" spaceAfter="0.0" />
 
<one:QuickStyleDef index="1"
                     name="p"
                     fontColor="automatic"
                     highlightColor="automatic"
                     font="Calibri"
                     fontSize="12.0"
                     spaceBefore="0.0" spaceAfter="0.0" />
    
<!-- ********** page settings ********** -->
<one:PageSettings RTL="false" color="automatic">
<one:PageSize>
<one:Automatic />
</one:PageSize>
<one:RuleLines visible="false" />
</one:PageSettings>
    
<!-- ********** title ********** -->
<one:Title selected="partial" lang="en-US">
<one:OE author="taoffi" authorInitials="T.N."
            lastModifiedBy="taoffi"
            lastModifiedByInitials="T.N."
            creationTime="2018-04-20T18:53:46.000Z"
            lastModifiedTime="2018-04-20T18:53:46.000Z"
            objectID="{9DB0692F-3758-49BC-8E87-F040F40599C2}{15}{B0}"
            alignment="left"
            quickStyleIndex="0" selected="partial">
<one:T><![CDATA[Sample page title]]></one:T>
<one:T selected="all"><![CDATA[]]></one:T>
</one:OE>
</one:Title>
    
<!-- ********** page content (main <div>) ********** -->
<one:Outline author="taoffi"
             authorInitials="T.N."
             lastModifiedBy="taoffi"
             lastModifiedByInitials="T.N."
             lastModifiedTime="2018-04-20T19:00:28.000Z"
             objectID="{9DB0692F-3758-49BC-8E87-F040F40599C2}{30}{B0}">
<one:Position x="36.0" y="86.4000015258789" z="0" />
<one:Size width="123.9317169189453" height="14.64842319488525" />
<one:OEChildren>
        <!-- ********** paragra^ph ********** -->
<one:OE creationTime="2018-04-20T18:53:47.000Z"
             lastModifiedTime="2018-04-20T18:53:52.000Z"
             objectID="{9DB0692F-3758-49BC-8E87-F040F40599C2}{33}{B0}"
             alignment="left"
             quickStyleIndex="1">
         <!-- ********** paragraph text ********** -->
<one:T><![CDATA[Sample page text]]></one:T>
</one:OE>
</one:OEChildren>
</one:Outline>
</one:Page>

 

OneNote 2010 vs. 2013 and above

There are some compatibility issues between OneNote API for 2010 version and 2013 and above. More changes have been introduced in 365 version.

The downloadable binaries here are for OneNote 2013 and 2016 desktop.

You may download the binaries Here!

Xamarin: the missing Description attribute

If you are a Windows C# programmer, you may know about the DescriptionAttribute and its help in describing an object, a property, method or any other member of a class in a human way.

You may also use [Description], to associate a Label to a property and later use it in the UI.

As Xamarin.Forms does not (yet) offer this simple useful attribute, I decided to write one.

The basic code

Our Description attribute class is quite simple:

[AttributeUsage(validOn: AttributeTargets.All, AllowMultiple = false, Inherited = false)]
public class DescriptionAttribute : Attribute
{
    public string _description;

    public DescriptionAttribute(string description)
    {
         _description = description;
    }

    public string Description
    {
        get { return _description; }
        set { _description = value; }
    }
} 

 

With this in place, we can now write things like:

[Description("This is my object")]
public class MyClass
{
    [Description("This is my constructor")]
    public MyClass()
    {
    }

    [Description("My property")]
    public string Property1
    {
     … 
    }

 

 

Reading back the descriptions

Well, but for this to be useful, we must do something to get back these descriptions when needed.

First: how to read the descriptions assigned to objects and members?

Some extension helpers can simplify this work (remember: we are in portable (PCL) library, and also using System.Reflection):

public static string Description(this Type objectType)
{
    if(objectType == null)
        return null;

    TypeInfo typeInfo = objectType.GetTypeInfo();
    var attrib = typeInfo.GetCustomAttribute<DescriptionAttribute>();
    return attrib == null ? null : attrib.Description;
}



To get the description of MyClass, The above extension method now allows us to write:

string    classDescription    = typeof(MyClass).Description();

 

Another method may even make that simpler: get the description of an object's instance of a given class:

public static string Description(this object obj)
{
    if(obj == null)
        return null;

    return obj.GetType().Description();
}


 

This allows us to write: myObject.Description(); to get the description of myObject's class.

Reading back the description of class members

Reading a given member (property / method…) description, we have to look into the MemberInfo object of that given member to find the Description attribute:

 

public static string MemberDescription(this MemberInfo member)
{
    if(member == null)
        return null;

    var attrib = member.GetCustomAttribute<DescriptionAttribute>();
    return attrib == null ? null : attrib.Description;
}

 

 

How to get a member info? Not quite handy!...

Let us simplify a little more. The following code (though it may be somehow difficult to read… see System.Linq.Expressions) will allow a much easier syntax.

 

public static string PropertyDescription<TMember>( Expression<Func<TMember>> memberExpression)
{
    if (memberExpression == null)
        return null;

    var expression = memberExpression.Body as MemberExpression;
    var member = expression == null ? null : expression.Member;

    if (member == null)
        return null;

    return member.MemberDescription();
}

 

 

The above method analyses its Expression parameter to extract the MemberInfo for us and calls the original method to extract the description of the member.

To use this method to extract the description of 'Property1' , we can write:

string propertyLabel = PropertyDescription(()=> Property1);

 

What about Enums?

As you know, we often give short (occasionally cryptic!) names for enum members. Displaying these names in a clear human-readable manner to the user for selecting a value is usually a challenge.

The Description attribute can help us assign these labels. Reading them back is somehow another challenge. The reason is that enum members are fields (in contrast to member properties and methods we saw before)

Actually, with enums, we have two challenges: find the description of each element, but also be able to retrieve the value of a selected description.

For the first challenge, find the description of an element, let us try this extension:

 

public static string EnumDescription(this Enum value)
{
    if(value == null)
        return null;

    var type = value.GetType();
    TypeInfo typeInfo = type.GetTypeInfo();
    string typeName = Enum.GetName(type, value);

    if (string.IsNullOrEmpty(typeName))
        return null;

    var field = typeInfo.DeclaredFields.FirstOrDefault(f => f.Name == typeName);

    if(field == null)
        return typeName;

    var attrib = field.GetCustomAttribute<DescriptionAttribute>();
    return attrib == null ? typeName : attrib.Description;
}

 

 

Sample usage:

 

public enum Flowers
{
    [Description("African lily")]
    Agapanthus,

    [Description("Alpine thistle")]
    Eryngium,

    [Description("Amazon lily")]
    Eucharis,
};

 

string africanLilly = Flowers.Agapanthus.EnumDescription(); //"African lily"
string alpineThistle = Flowers.Eryngium.EnumDescription(); //"Alpine thistle"

 

Find the enum value by its description

The second task is to retrieve the enum value by its description. For instance when the user selects one of the displayed descriptions.

 

public static int EnumValue(this Enum value, string selectedOption)
{
    var values = Enum.GetValues(value.GetType());

    foreach(Enum v in values)
    {
        string description = v.EnumDescription();

        if(description == selectedOption)
            return (int) v;
    }

    return 0; // arbitrary default value
}

 

 

Now we can write:

 

Flowers africanLilly = Flowers.EnumValue("African lily");    // Agapanthus
Flowers amazonLilly = Flowers.EnumValue("Amazon lily");    // Eucharis