www.sixfingeredman.net .................................................. ::. . . . . . . |
HOME readme brain ideas todo writing photos graphics projects quotes recipes books movies links old site |
vector irCapitalized words should have slightly-increased weight. This is actually a rough way of adding a spelling metric; ideally all words wouldn't be orthogonal, but rather separated according to their relatedness. Because capitalization sometimes has meaning, but not always, we can't really risk treating different capitalizations as fully orthogonal. Changing the weight works, and since the "important" words are usually capitalized, it meshes reasonably with the current use of weight to represent importance. Each document vector encodes not only information about the document, but also about the relatedness of words. You can find the relatedness of two words by averaging the difference in their components across all documents. One way to improve matching is to perform comparisons at different levels. If you encode sentence vectors, these can be assembled to paragraph and document vectors, and document vectors can be grouped as well. (Say you wanted to find the folder with the most documents pertaining to a query.) |
The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. -- Hans Reiser