Ideas for mapping any $searchterms to given set of $results (stemming, spelling correction, etc.) with AS?


We have an application that is dealing with user input on a somewhat broader base. Analysis showed that we loose quite some folks because their input does not match what we have prepared in our dataset. Users would type in ‘cars’, ‘car’, ‘cars driving’ and ‘carrs’ etc. while we only have ‘car’ prepared.

I wonder if anybody has already worked on something that is dealing with user-provided search terms that should each map to a single, most matching result record (prepared results).

Think of it like the ‘google input’-problem. They got dozens of millions of indexes, but not an infinite amount of them. However, they still can return results for any input you provide them with that has atleast some similarity with something existent.

This is a kinda tough problem to model in a NoSQL datastore, so it might be valuable to have a public discussion about how this might be implemented (especially with AS, but if somebody implemented something using Redis, that would be interesting too, I guess…)

So, when splitting this problem up into smaller, more solveable parts, one ends up thinking about ‘stemming’ (removing plurals, bringing verbs/nouns to a common ‘base’) and spelling correction and word order ‘normalization’(?). First one is kinda easy to solve with porter-stemmer-algorithm, the second one not so much (especially if you don’t have a db of correct variants of a word before stemming).

Would you try to implement those things with AS, or jump to a fallback solution like Apache Solr in case no entry could be found?

Of course there are alternative/simpler approaches to this problem like offering auto-suggest, asking users to re-check their queries and so on, but since we are really concerned about usability - and space requirements for our huge result-records - over here, we would like to eliminate as much trouble as possible. Since we haven’t really decided on algorithms yet, we haven’t spend a single second on what data model to use within AS or whether we should use AS for this at all.

Any input welcome.