- Index
- In Lucene an Index is in a directory
- All files constitute an Index
- Segment
- An Index could contain a lot of Segments, each Segment is independent.
The new added document could be build into a new Segment, different Segment can be merged. - If files prefix is same, they belong to same Segment, like “_0”, “_1”, “_2”.
- segments.gen and segments_X is Segment’s metaldata, storage it’s propertites information.
- An Index could contain a lot of Segments, each Segment is independent.
- Document
- Document is the basic unit in building Index. Different Document storage in different Segment, a Segment can contain a lot of Documents.
- New added Document is in Segment new build, when Segment be merged, different document be merged into same Segment.
- Field
- A document may contain different type informations, like time, content, write and so on, it can be index separately, and be storage in different Term.
- Different Term’s Index way can be different, when analysis Term’s storage, we would explain it.
- Term
- Term is the basic unit in Index. It is the string after lexical analysis and language processing
名称 | 文件拓展名 | 描述 |
---|---|---|
段文件 | segments_N | 保存了索引包含的多少段,每个段包含多少文档。 |
段元数据 | .si | 保存了索引段的元数据信息 |
锁文件 | write.lock | 防止多个IndexWriter同时写到一份索引文件中。 |
复合索引文件 | .cfs, .cfe | 把所有索引信息都存储到复合索引文件中。 |
索引段的域信息 | .fnm | 保存此段包含的域,以及域的名称和域的索引类型。 |
索引段的文档信息 | .fdx, .fdt | 保存此段包含的文档,每篇文档中包含的域以及每个域的信息。 |
索引段Term信息 | .tim, .tip | .tim文件中存储着每个域中Term的统计信息且保存着指向 .doc, .pos, and .pay 索引文件的指针。 .tip文件保存着Term 字典的索引信息,可支持随机访问。 |
文档中Term词频和跳表信息 | .doc | 保存此段中每个文档对应的Term频率信息。 |
文档中Term的位置信息 | .pos | 保存此段中每个文档对应的Term位置信息。 |
文档的有效载荷和部分位置信息 | .pay | 保存此段中每个文档的有效载体(payload) 和 Term的位置信息(offsets)。 其中有一部分的Term位置信息存储在.pos文件中。 |
索引字段加权因子 | .nvd, .nvm | .nvm 文件保存索引字段加权因子的元数据 .nvd 文件保存索引字段加权数据 |
索引文档加权因子 | .dvd, .dvm | .dvm 文件保存索引文档加权因子的元数据 .dvd 文件保存索引文档加权数据 |
索引矢量数据 | .tvx, .tvd, .tvf | .tvd 存储此段文档的Term、Term频率、位置信息、有效载荷等信息。 .tvx 索引文件,用于把特定的文档加载到内存。 .tvf 保存索引字段的矢量信息。 |
有效文档 | .liv | 保存有效文档的索引文件信息 |
Name | Extension | Brief Description |
---|---|---|
Segments File | segments.gen, segments_N | Stores information about segments |
Lock File | write.lock | The Write lock prevents multiple IndexWriters from writing to the same file. |
Compound File | .cfs | An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles. |
Fields | .fnm | Stores information about the fields |
Field Index | .fdx | Contains pointers to field data |
Field Data | .fdt | The stored fields for documents |
Term Infos | .tis | Part of the term dictionary, stores term info |
Term Info Index | .tii | The index into the Term Infos file |
Frequencies | .frq | Contains the list of docs which contain each term along with frequency |
Positions | .prx | Stores position information about where a term occurs in the index |
Norms | .nrm | Encodes length and boost factors for docs and fields |
Term Vector Index | .tvx | Stores offset into the document data file |
Term Vector Documents | .tvd | Contains information about each document that has term vectors |
Term Vector Fields | .tvf | The field level info about term vectors |
Deleted Documents | .del | Info about what files are deleted |
Lucene’s index not only storage positive mapping but also storage negative mapping
Positive mapping
- From Index to Term : Index –> segment –> Document –> Field –> Term
- Each upper floor storage it’s children floors’ matedata. Like a province, a city, a county, they got it’s chilren’s info.
- segments_N : how many segment the Index have, how many Documents each segment have.
- .fnm : how many Fields the segment contain, each Field’s name and Index way.
- .fdx , .fdt : all Documents the segment have, how many Fields each Document have, what information each field recorded.
- .tvx , .tvd , .tvf : how many Documents the segment have, how many Fields each Document have, how many words each Field have, every words’ string, position, and so on.
Negative mapping
- Term -> Document
- .tis , .tii : Term dictionary, that is segment’s words sort by alphabet sequencely.
- .frq : posting sorted table, that is table that contain all words’ Document ID.
- .prx : the word position in Document at posting sorted table.
Primary Type
- Byte : the most basic type, 8 bits long.
- UInt32 : composed by 4 Bytes.
- UInt64 : composed by 8 Bytes.
- VInt :
- May be composed by many Bytes.
- Front byte represent lower number bit.
- For example: 51271 - [1]1000111, [1]0010000, [0]0000011
- Chars : UTF-8 encoding bytes.
- String : first a VInt represent Char numbers, then a series of Chars.