I spent the past few months working on a new web project referencing online e-books (http://doc5.5ync.net/)
The goal of the project was not to build a new online library (many good libraries are already out there) but rather to offer a central reference for all what exists, adding some features for these references to provide a new analytical view of e-books.
Most of online libraries offer access to books that are now in the ‘public domain’ (I.e. no more copyright protected) and thus available for free download.
For an analytical approach, I started to use the Trie structure (I talked about this in a previous post) for analyzing textual elements of the referenced e-books to provide relational aspects among them.
Just a reminder, explained in the previous post: a Trie is a tree-like structure where a node has a parent, neighbors and descendants. The structure is particularly interesting for text indexing because, whatever the language, any textual unit (word) is forcibly composed of a set of that language’s alphabet (whose number is quite limited). Adding a flag to end-of-word nodes, we can build a Trie whose root is composed of the few units of the alphabet with branches to text words.
This compact structure enables fast and efficient search and retrieve elements into large text sequences. Which seems to be a good base for our e-book text indexing and analysis.
Using the trie structure to index e-book details (titles, description, author…) of the relatively large number of referenced e-books (approx. 9000 as of writing) was straightforward and efficient.
Now, a given unit (word) in this trie might be related to one or more of our e-books. How to link our trie nodes each to its set of ‘data’? That is the subject of this brief post.
We are going to build upon the elements mentioned in in the previous post:
- We will use our Trie with its (char) Dictionary and Nodes.
- Our trie provides us with its words presented as a collection of iTrieWord objects
- Let us create a new object iTrieDataWord (deriving from iTrieWord)
- This last object will contain a collection of ‘Data items’ (in our concrete case, this will be a collection of e-books)
How to proceed?
After some experimentations , I ended up using the following steps which seemed to be good in regards of efficiency and performance:
- Load all e-books’ textual sequences (titles, descriptions, author information… for the time being)
- Build the Trie of this text sequences (more about this later)… which provides us with its Words (iTrieWord) collection
- Now, in the loaded collection of e-book records (the iDataItem(s)). (Each record contains the e-book title, description and author information)… each record (iDataItem) can assign itself to any of the Trie words whenever that word is part of its own data.
Some additional considerations in the process are quite important:
- One important point is to define “What is a ‘Word’”? in terms of minimum number of characters to consider a sequence as a ‘word’. As the referenced e-books are multilingual, it was somehow clear that this threshold is language-dependent. In Arabic, for instance, words tend to be short in terms of number of characters (Arabic vowels are often part of the character). After some research, I found that considering 4 chars as a minimum is an acceptable compromise as it allows searching the e-books by year (author’s or book’s) which may be quite useful.
- It is also important to define what are ‘word-delimiters’ (spaces are not the only ones to consider!). Actually, that is also language-dependent in some ways… and as such requires experimentations with all languages to be used in the given project.
- Finally: what are we going to do for all this to b useful?... I.e. Are we going to persist this Trie? Or rather proceed as a (runtime queryable) indexing service?… etc. For doc5 project, we decided to persist the results in data tables / running the scan process periodically
Some performance numbers
Some numbers to justify using the above steps:
- Reading data records + Building a Trie of 40365 words (min = 4 chars): 17s
- Processing 9000 e-book information (I.e. building the Trie + creating 358000 links to its words): 8min30s
Will post some sample code in the coming weeks. You may have a look at http://doc5.5ync.net/ (The current version for presenting the results).
A bit late!: Wish you all a happy 2020 year, with many useful projects and much fun!