I continue here the excursion around using the Trie pattern and structures to index e-book words for the doc5ync project.
If you missed the beginning of the story, you can find it Here, Here and Here…
The role of the client integration tool (a WPF app) is to pull e-books information to be indexed from the database, proceed to indexing the words and creating the links between each word and its related e-book. This is done using some settings: the language to index, the minimum number of chars to consider a sequence as a ‘word’… etc.
The integration process flow is quite simple:
- Once we are happy with the obtained results, we use the tool to push the trie to the database in a staging table.
- A database stored procedure can then extract the staging data into the tables used for presenting the index on the project web page.
The staging table has a few fields:
- The word string
- The related e-book ID (relationship => docs table (e-books))
- The number of occurrences of the word
- The timestamp of the last insertion
The only difficulty encountered was the number of records (often tens of thousands) to push to the staging table. The (artisanal!) solution was to concatenated values of blocks of records to be inserted (I.e.: ‘insert into table(field1, field2, …) values ( v1, v2, …), (v3, v4, …), …’ etc.). Sending 150 records per command seemed to be a sustainable choice.
The staging table data is to be dispatched into two production tables:
- word ID
- language ID
- word string
- word’s number of occurrences
- word ID (relationship => the above table)
- e-book ID (relationship => docs (e-books) table)
Once the data is in the staging table, the work of the stored procedure is quite straightforward:
- Delete the current words table (which cascade deletes the words / docs reference records)
- Import the staging word (strings and occurrences) records into doc5_trie_words
- Import the related word / doc IDs into doc5_trie_word_docs.
Many words are common between languages and e-books. Therefore assigning a language to a word has no sense unless all its related documents are from one specific language. That is the additional and final task of the stored proc.
Next step: the index web page presentation!
That will be the subject of the next post!