I continue here the excursion around using the Trie pattern and structures to index e-book words for the doc5ync project.
If you missed the beginning of the story, you can find it Here, Here and Here…
The role of the client integration tool (a WPF app) is to pull e-books information to be indexed from the database, proceed to indexing the words and creating the links between each word and its related e-book. This is done using some settings: the language to index, the minimum number of chars to consider a sequence as a ‘word’… etc.
The integration process flow is quite simple:
- Once we are happy with the obtained results, we use the tool to push the trie to the database in a staging table.
- A database stored procedure can then extract the staging data into the tables used for presenting the index on the project web page.
The staging table has a few fields:
- The word string
- The related e-book ID (relationship => docs table (e-books))
- The number of occurrences of the word
- The timestamp of the last insertion
The only difficulty encountered was the number of records (often tens of thousands) to push to the staging table. The (artisanal!) solution was to concatenated values of blocks of records to be inserted (I.e.: ‘insert into table(field1, field2, …) values ( v1, v2, …), (v3, v4, …), …’ etc.). Sending 150 records per command seemed to be a sustainable choice.
The staging table data is to be dispatched into two production tables:
- word ID
- language ID
- word string
- word’s number of occurrences
- word ID (relationship => the above table)
- e-book ID (relationship => docs (e-books) table)
Once the data is in the staging table, the work of the stored procedure is quite straightforward:
- Delete the current words table (which cascade deletes the words / docs reference records)
- Import the staging word (strings and occurrences) records into doc5_trie_words
- Import the related word / doc IDs into doc5_trie_word_docs.
Many words are common between languages and e-books. Therefore assigning a language to a word has no sense unless all its related documents are from one specific language. That is the additional and final task of the stored proc.
Next step: the index web page presentation!
That will be the subject of the next post!
Browsers add-ons and plug-ins are nice features that can sometimes become annoying. Flash Player, for instance, is nice but extensively used in boring commercial ads.
Fortunately you can either disable the annoying add-on or allow it to run only on specific sites.
To do this in Internet Explorer:
- First click Internet options / Programs / Manage add-ons.
- Select the add-on and click More Information.
- By default, the add-on is allowed to run on all sites. This is displayed as * (asterisk). Click Remove all sites.
- Click OK to confirm your choice and close the Internet options dialog. You are done.
- Now, the add-on will no more be executed on any web site.
- In fact, each time a web site needs to run the add-on, IE will ask you if you want to allow it to run on this specific site.
- The list of allowed web sites is then maintained by Internet explorer.
After some weeks / months / years… this list may become quite long.
Now, what to do to remove just ONE site or TWO of this allowed list?
Internet Explorer allows you to REMOVE ALL SITES… There is no button to remove just ONE site!
So, you remove all sites and start a new history again… More annoying than letting the add-on run on all sites… isn't it?
WHERE does IE keep this list? Mystery!
After a lot of search, I didn't find any information. Until….
Yes… it is in the registry (Please be extremely cautious when modifying this vital thing: The registry)
HKey_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\[add-on class ID]\iexplore\AllowedDomains
Example for Flash Player add-on:
When all sites are allowed:
When only some are:
Now, if we want to delete just ONE web site, simply delete its key.
Again: Please be extremely cautious when modifying this vital thing: The registry
Communication is a vast and ancient field which ran through many evolution phases, experiments and research. Internet, as we know it today, is one of the outcomes of this human work.
In this story, no one technology proved eternal or ever-lasting. Cycles of evolution produced some usable and convenient forms of communication at a time. Internet seems to be just one of these.
But now, each time I look at a web page, I feel that just cannot last much longer (at least I hope so:)): these elastic regions, shapeless tables, unexpected fonts changes, images and colors… hazardous page reloads and other ‘partial updates’ (sometimes even more annoying)… the whole mess of plug-ins, add-ins and other artifacts…
That really doesn’t seem to be an ever-lasting model!
Searching for something on Internet is even worse. Just go search for the word ‘sequence’ and you will find yourself with a non-classified mess of subjects ranging from cinema to molecular biology!
Looking for a solution of a problem?... you may find many, often ten or fifteen years old… which rarely relates to you current question.
To keep some ‘Lasting Value’ for old archived articles, many publishers no more mention the articles’ publication date… and search engines don’t help much in finding out a time-classification of a search result. All these ‘partners’ (publishers / search engines) are happy with this. As long as the consumer (you and I) don’t complain, the business just continue to run with minimal costs!
The appropriateness between what is needed and what is offered seems to be near a break-point.
In parallel, the great rise of technologies like web services and the wide range of their implementations may just allow us to hope for something new to emerge: new stable and appealing content explorers / new relevant and coherent search engines.