www.sixfingeredman.net
..................................................
::. .  .   .     .       .          .

HOME
readme
brain
ideas
todo
writing
photos
graphics
projects
quotes
recipes
books
movies
links
old site

vector ir

Capitalized words should have slightly-increased weight. This is actually a rough way of adding a spelling metric; ideally all words wouldn't be orthogonal, but rather separated according to their relatedness. Because capitalization sometimes has meaning, but not always, we can't really risk treating different capitalizations as fully orthogonal. Changing the weight works, and since the "important" words are usually capitalized, it meshes reasonably with the current use of weight to represent importance.

Each document vector encodes not only information about the document, but also about the relatedness of words. You can find the relatedness of two words by averaging the difference in their components across all documents.

One way to improve matching is to perform comparisons at different levels. If you encode sentence vectors, these can be assembled to paragraph and document vectors, and document vectors can be grouped as well. (Say you wanted to find the folder with the most documents pertaining to a query.)


The utility of an operating system is more proportional to the number of
connections possible between its components than it is to the number of
those components.
	-- Hans Reiser