Full-text retrieval fundamental (3)

How to search in Index

How to find the result you want most, or the most relevant result with query phrase?

1. first: User input query pharse

The basic query grammar like “AND”, “OR”, “NOT”.
elephant and tiger or lion not sheep

2. second: Query pharse lexical analysis, syntax analysis, and language processing

lexical analysis used to distinguish words and key words
syntax analysis used to build a grammar tree by syntax rule
language processing just like the processe in building Index. check page 2

3. third: Traverse Index, get result that fit syntax tree

first find the documents contain words(elephant tiger lion) in the posting list.
second merge documents contain both elephant, tiger or lion.
third separate documents contain sheep, got result contain both elephant, tiger or lion not sheep.

4. fourth: Sort result by relevant between query pharse and search result

How to calculate the relevance between documents and query pharse?
Toke query pharse as a shot document, scoring relevance between documents, the higher score is the higher rank document is.
How to score the relevance between documents?
It’s not easy a thing

Check what is the important factors between documents
Check the relation between these factors

The process of finding the importance of a word (Term) to a document is called the weight (Term) process.
The process of judging the relationship between Term to get document relevance using an algorithm called Vector Space Model

Term weight process

This is a simple classic implementation, lucene’s have a little difference

$w_{t_\eta}{_d} = tf_{t_\eta}{_d} \times log(n/df_t)$

$w{t\eta}{_d}$ = the weight of the term t in document d
$tf{t\eta}{_d}$ = frequency of term t in document d
$n$ = total number of documents
$df_t$ = the number of documents that contain term t

Term Frequency (tf): How many times this Term show in this document, the bigger tf is , the more importance this Term is.
Document Frequency (df)：How many documents contain this Term, the bigger df is, the less importance this Term is.

Like programmer, the deeper technology you leanr is better(tf big), the less technology people know is better (df little). When finding job your competitive power would be grate. Man’s value is about unsubsititutability

Vector Space Model

Less the two vector’s angle is the more relevance is
We take two vector’s consine as the score point