Some undocumented Purplesearch; some notes on the recommender
Some of the pages developed in the process of making PurpleSearch are not currently exposed to users, or aren't meant for users at all. Some of the more interesting ones include:
Manual database selection (by subject, by name), which was used when testing specific targets, but is not considered a feature and has been removed because in everyday use it doesn't have that much value over...
Category search, which uses the same categorization of searchable targets. It also includes links to websites that are not searchable by federated search, because there are important resources like this in various subject areas, in our case largely for law and business/economics.
These pages were made to some degree to act as department-specific start pages for search.
Phrase/string relations, is something like a query analysis without a search. I mostly use it to see how well a string is known.
It also shows something I've played with on and off, namely guessing the subject areas a string/phrase is relevant to. It depends a lot on how well known the string and its relations are at the time of querying, but for the most part works pretty well -- better than I thought it would work, since I actually know how simple-and-stupid the code currently is :)
There is also an XML version that exposes the more basic data on this page, in more parsable form, which is being used by some systems elsewhere. Please do ask if you want to use this.
Topical words (livetrix)
A week ago I found some half-finished code that would estimate how related two words are. I finished it, and used it to back an experiment that guesses which general topic / field a word belongs to.
Take a look and play around. So far it seems to work better than I thought it would, considering it's not trained in any real machine learning sense.
At the time, the idea behind this code was to try to support potential features we discussed, such as trying to match uesr queries to source by field, as a fallback or augmentation of the recommender feature.
Back to work (livetrix)
I'm back to work on livetrix. The work I have to do for my thesis is still dividing my time, but such is life.
Let's see... I now have a separate virtualhost that I develop on, so you shouldn't see errors caused by syntax typos or refactoring.
I've started thinking about, fixing for, and experimenting with social bookmarking features.
I could argue about the problems of tags being a bit double-edged depending on how they're used, but I think I should be more immediately worried about the interface to organize bookmarks and/or to search for them (may be the same, may be different), which will probably have some lateral and/or fuzzy browsing, but that's the problem waiting to happen -- I've noticed that various sites in this area, like bibsonomy and PennTags, and even good 'ol delicious, have good ideas but just end up too messy, usually because they try to do too much, or show too much at once. It may appeal to geeks, but less to people for whom such an interface is the only choice, and a potential turnoff.
Slow livetrix development for a bit
Just a note that plans (such as making the bookmarking more accessible and sharable, and more of the 2.0ish just-theories-at-this-point) are on hold for the moment, as I'm working on my thesis. The thesis wasn't getting done while I was trying to both it and this development, you see...
I do tweak bits and bobs, but won't do any real development for at least the following month, or however long it takes me to finish this thesis work.
Book relations (livetrix)
The ISBN relation feature has now been polished enough to mention.
For an example, visit this bookmark and click 'Search for ISBN relations'. It will take a few seconds, then show a box for each related ISBN.
This example was chosen because our library has many of the related books. The ones are presented in a block before the ones we don't, and link to our OPC, This is useful for people that want to actually get these books from the library, since sorting through a huge pile of ISBNs would be no fun.
It takes a few seconds because the relation data comes from two sources, xISBN and ThingISBN.
ThingsISBN is relatively instant because this data is stored in a local database, while the xISBN data is fetched via a web API - and will invisibly stop working once more than 500 queries are sent in a day. As a programmer I'm not a fan of this detail, but xISBN (at least as of right now) returns much more detailed data: the title and other details in the hover text comes from it.
Category experiment (livetrix)
Pages with english records should now have augmented guesses as to each record's subject, which should make skimming through results a little easier. (The green boxes; there were already grey boxes of promiment-and-interesting keywords, but these are now replaced by these subjects, when they can be guessed halfway sensibly.)
It should respond somewhat to varying specificity in both on the page and in the records, though it is generally more likely to report the roughly represented fields. It's notably more detailed for biology and medicine since the feature is based from not only wikipedia data - which categorywise seems to specialize in music and sport - but also MeSH.
It doesn't do so well when it doesn't have much data to work on. The best examples are currently the sources that provide rich metadata, specifically keywords, since they are likely to be controlled, clean, and hit on the phrase data used for this feature. There are enough sources that are minimal in terms of metadata, though, and there's quite logically less to go on (sometimes much less), and more guesswork is involved in picking terms that should be used to categorize.
Making it work halfway sensibly was a bit of a challenge, since without a good bit of weighing and filtering, you get a lot of nonsense suggestions from general words that by common sense don't really have a subject, or because it has a skewedly strong relation to one thing - suddenly an article is about cricket because it has the words 'reverse' and 'approach' in it, about marvel comics because of words like 'shape' and 'vector,' or about knitting because of words like 'increase' and 'decrease.' Correct in its way, but not quite useful here :)
Reviews and RSS (livetrix)
I added the ability to add two bits of text onto each bookmark you keep: Notes, which only you see, and reviews, which everyone can see while browsing and searching bookmarks -- which I've yet to really implement as of this writing.
You can use basic HTML but I've added basic hack-proofing, though I may need to move to BBCode since I've been asked to allow links, which in HTML form is potentially easy to abuse. It's amazing how many hacks are out there, and I don't want to be responsible for any sort of code injection.
Reviews should show up in search results, but I've yet to figure out how to do this properly. There are two main problems: records from different sources do not have be directly comparable to represent the same article or book, and it has to scale to easily allow matching thousands of reviews to a page of results.
Search for a perfect interface ....?
Livetrix : building around the (metalib)box
Digicmb's Guus van den Brekel presented our LiveTrix Metasearch Project in the Norwegian Knowledge Centre for the Health Services in Oslo and at the NCTU Library (UBiT) in Trondheim.
You can have a look at his excellent presentation here:
The workbench (livetrix)
Since I said I would, I made something clusty-like. That is, something that I originally imagined would be like it. It's a proof-of-concept feature that focuses on using metadata more than it does on clustering semantics.
It currently considers material type, keywords, years, authors, and database source. Each of those are a set of positive filters, nothing crazy. Deselect what you don't want to see, click a name to solo it (...which seemed handy in practice). See if you can make sense of it, and please complain if you can't, since that'd probably mean design flaws:)
You can try it by using the demo login (...and then remember you may not be the only one playing with it).
More than basic Latin scripts (livetrix)
I played with transliteration, which practically means that entering moskva will make the sounds-like feature suggest москва and vise versa (a source's response and possible transliteration is still up to just it, of course; I could augment queries with OR'd alternatives, but I'm wary of doing so since there are always some sources that respond badly).
I've played a little with other alphabets: Latin variants should already have started working, Greek can be and is already is half added, Chinese is fairly impossible, the Japanese kana would be doable but the kanji less so, abjads like Hebrew and Arabic won't work well unless I can find a way to even just roughly phonetically add the implied vowels on the fly (though since it makes the sounds-like server a little slower I won't add the features unless there is demand) and abugidas are a little more hopeless yet.


RUG Combine & RUGlinks Weblog