Full-text retrieval fundamental (2)

How to build Index

1. Prepare origin Document

  • file1: Researches of Chinese full-text search technologies based on word indexing is related to many fields.
  • file2: The Index Data Service provides basic full-text functions for storage and retrieval of terms and indexed summary documents.

    2. Put Document TOKENIZER

  1. split Document into words
  2. separate symbols
  3. separate Stop word

    Stop word in english like: “like”, “a”, “this”…
    After Tokenier got Token:
    “Researches” “Chinese” “full” “text” “search” “technologies” “word” “indexing” “related” “many” “fields” “Index” “Data” “Service” “provides” “basic” “full” “text” “functions” “storage” “retrieval” “terms” “indexed” “summary” “documents”

3. Put TOKEN to LINGUISTIC PROCESSOR

  1. to Lowercase
  2. words reduce to root type like “fields” to “field” stemming
  3. words to origin type like “indexed” to “index” lemmatization

    the difference between “Stemming” and “lemmatization”

    • same: make words to initial
    • difference:

      • Stemming is reduce
      • lemmatization is change
    • difference in algorithm:

      • Stemming is delete “s”, “ing”->”e”, “ational”->”ate”, “tional”-> “tion”
      • lemmatization is “drove” -> “drive”
    • they are not mutex, but mates

    After linguistic processor result be call Term:
    “researche” “chinese” “full” “text” “search” “technologie” “word” “index” “relate” “many” “field” “index” “data” “service” “provide” “basic” “full” “text” “function” “storage” “retrieve” “term” “index” “summary” “document”

    Because the linguistic processor when search drove, drive’s documents can be found.

4. Put TERM to INDEXER

  1. Build a dictionary in Term
Term Document ID
researche 1
chinese 1
full 1
text 1
search 1
technologie 1
word 1
index 1
relate 1
many 1
field 1
index 2
data 2
service 2
provide 2
basic 2
full 2
text 2
function 2
storage 2
retrieve 2
term 2
index 2
summary 2
document 2
  1. sort table by key’s first letter
Term Document ID
basic 2
chinese 1
data 2
document 2
field 1
full 1
full 2
function 2
index 1
index 2
index 2
many 1
provide 2
relate 1
researche 1
retrieve 2
search 1
service 2
storage 2
summary 2
technologie 1
term 2
text 1
text 2
word 1
  1. merge same Term into Posting List
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    Term-[Document Frequency] | DocumentID-Frequency
    basic-1 2-1
    chinese-1 1-1
    data-1 2-1
    document-1 2-1
    field-1 1-1
    full-2 1-1 -> 2-1
    function-1 2-1
    index-2 2-2 -> 1-1
    many-1 1-1
    provide-1 2-1
    relate-1 1-1
    researche-1 1-1
    retrieve-2 2-1
    search-1 1-1
    service-2 2-1
    storage-2 2-1
    summary-2 2-1
    technologie-1 1-1
    term-2 2-1
    text-2 2-1 -> 1-1
    word-1 1-1
  • Document Frequency: Document appear times
  • Frequency: Term appear times in Document

When searching “drive” “driving” “drove” “driven” will be processor to drive like build Index process