Some undocumented Purplesearch; some notes on the recommender
Some of the pages developed in the process of making PurpleSearch are not currently exposed to users, or aren't meant for users at all. Some of the more interesting ones include:
Manual database selection, useful for testing specific targets. It is not really considered a feature (and may be removed) because it has little added value over...
Category search, which uses the same categorization of searchable targets. It also includes links to websites that are not searchable by federated search, because there are important resources like this in various subject areas, in our case largely for law and business/economics.
These pages were made to some degree to act as department-specific start pages for search.
Phrase/string relations, is something like a query analysis without a search. I mostly use it to see how well a string is known.
It also shows something I've played with on and off, namely guessing the subject areas a string/phrase is relevant to. It depends a lot on how well known the string and its relations are at the time of querying, but for the most part works pretty well -- better than I thought it would work, since I actually know how simple-and-stupid the code currently is :)
There is also an XML version that exposes the more basic data on this page, in more parsable form, which is being used by some systems elsewhere. Please do ask if you want to use this.
Codes and crosswalks (purplesearch)
We use data to look up Libary of Congress Classification, Dewey, and also a smaller and local Dutch BCL (BasisCLassificatie) codes, which we mostly use to present readable english forms of whatever codes appear in records.
The data that backs that is exposed in XML form. It took me some work collect, and I figured I might save someone else the trouble. Note that BCL has relatively few codes at all, while the Dewey and LCC sdata here are not as specific as the actual systems go -- this contains only the more general classification (which seems to be the part that you don't have to buy to use).
More interestingly than just the codes, though, are that we've been keeping track of co-occurring codes, with ideas like that we can create crosswalk maps between codes, for various potential benefits in searching and browsing.
Topical words (livetrix)
A week ago I found some half-finished code that would estimate how related two words are. I finished it, and used it to back an experiment that guesses which general topic / field a word belongs to.
Take a look and play around. So far it seems to work better than I thought it would, considering it's not trained in any real machine learning sense.
At the time, the idea behind this code was to try to support potential features we discussed, such as trying to match uesr queries to source by field, as a fallback or augmentation of the recommender feature.
Back to work (livetrix)
I'm back to work on livetrix. The work I have to do for my thesis is still dividing my time, but such is life.
Let's see... I now have a separate virtualhost that I develop on, so you shouldn't see errors caused by syntax typos or refactoring.
I've started thinking about, fixing for, and experimenting with social bookmarking features.
I could argue about the problems of tags being a bit double-edged depending on how they're used, but I think I should be more immediately worried about the interface to organize bookmarks and/or to search for them (may be the same, may be different), which will probably have some lateral and/or fuzzy browsing, but that's the problem waiting to happen -- I've noticed that various sites in this area, like bibsonomy and PennTags, and even good 'ol delicious, have good ideas but just end up too messy, usually because they try to do too much, or show too much at once. It may appeal to geeks, but less to people for whom such an interface is the only choice, and a potential turnoff.
Slow livetrix development for a bit
Just a note that plans (such as making the bookmarking more accessible and sharable, and more of the 2.0ish just-theories-at-this-point) are on hold for the moment, as I'm working on my thesis. The thesis wasn't getting done while I was trying to both it and this development, you see...
I do tweak bits and bobs, but won't do any real development for at least the following month, or however long it takes me to finish this thesis work.
Book relations (livetrix)
The ISBN relation feature has now been polished enough to mention.
For an example, visit this bookmark and click 'Search for ISBN relations'. It will take a few seconds, then show a box for each related ISBN.
This example was chosen because our library has many of the related books. The ones are presented in a block before the ones we don't, and link to our OPC, This is useful for people that want to actually get these books from the library, since sorting through a huge pile of ISBNs would be no fun.
It takes a few seconds because the relation data comes from two sources, xISBN and ThingISBN.
ThingsISBN is relatively instant because this data is stored in a local database, while the xISBN data is fetched via a web API - and will invisibly stop working once more than 500 queries are sent in a day. As a programmer I'm not a fan of this detail, but xISBN (at least as of right now) returns much more detailed data: the title and other details in the hover text comes from it.
Category experiment (livetrix)
Pages with english records should now have augmented guesses as to each record's subject, which should make skimming through results a little easier. (The green boxes; there were already grey boxes of promiment-and-interesting keywords, but these are now replaced by these subjects, when they can be guessed halfway sensibly.)
It should respond somewhat to varying specificity in both on the page and in the records, though it is generally more likely to report the roughly represented fields. It's notably more detailed for biology and medicine since the feature is based from not only wikipedia data - which categorywise seems to specialize in music and sport - but also MeSH.
It doesn't do so well when it doesn't have much data to work on. The best examples are currently the sources that provide rich metadata, specifically keywords, since they are likely to be controlled, clean, and hit on the phrase data used for this feature. There are enough sources that are minimal in terms of metadata, though, and there's quite logically less to go on (sometimes much less), and more guesswork is involved in picking terms that should be used to categorize.
Making it work halfway sensibly was a bit of a challenge, since without a good bit of weighing and filtering, you get a lot of nonsense suggestions from general words that by common sense don't really have a subject, or because it has a skewedly strong relation to one thing - suddenly an article is about cricket because it has the words 'reverse' and 'approach' in it, about marvel comics because of words like 'shape' and 'vector,' or about knitting because of words like 'increase' and 'decrease.' Correct in its way, but not quite useful here :)
Reviews and RSS (livetrix)
I added the ability to add two bits of text onto each bookmark you keep: Notes, which only you see, and reviews, which everyone can see while browsing and searching bookmarks -- which I've yet to really implement as of this writing.
You can use basic HTML but I've added basic hack-proofing, though I may need to move to BBCode since I've been asked to allow links, which in HTML form is potentially easy to abuse. It's amazing how many hacks are out there, and I don't want to be responsible for any sort of code injection.
Reviews should show up in search results, but I've yet to figure out how to do this properly. There are two main problems: records from different sources do not have be directly comparable to represent the same article or book, and it has to scale to easily allow matching thousands of reviews to a page of results.
Search for a perfect interface ....?
Livetrix : building around the (metalib)box
Digicmb's Guus van den Brekel presented our LiveTrix Metasearch Project in the Norwegian Knowledge Centre for the Health Services in Oslo and at the NCTU Library (UBiT) in Trondheim.
You can have a look at his excellent presentation here:
The workbench (livetrix)
Since I said I would, I made something clusty-like. That is, something that I originally imagined would be like it. It's a proof-of-concept feature that focuses on using metadata more than it does on clustering semantics.
It currently considers material type, keywords, years, authors, and database source. Each of those are a set of positive filters, nothing crazy. Deselect what you don't want to see, click a name to solo it (...which seemed handy in practice). See if you can make sense of it, and please complain if you can't, since that'd probably mean design flaws:)
You can try it by using the demo login (...and then remember you may not be the only one playing with it).


RUG Combine & RUGlinks Weblog