科技新知

有些時候,我們對一些文章資料,光是使用Ctrl-F文字區配搜尋,很難找到完全吻合的結果。這時候,我們可以試試看快速搭建自己的中文搜尋引擎,看看能不能更易地找到資料。而中文搜尋引擎,其實用免費的elasticsearch也可以做到。我們就來看看怎樣快速起lab吧。

經 docker 下載及運行 elasticsearch

docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.license.self_generated.type=basic" \
  -v "elasticsearch-data:/usr/share/elasticsearch/data" \
  docker.elastic.co/elasticsearch/elasticsearch:8.17.0

建立資料庫。在elasticsearch 中,示作index,並建立自己的n-gram analyzer和tokenizer。

curl -X PUT "localhost:9200/book-ngram?pretty" -H 'Content-Type: application/json' -d'{  "settings": {    "index" : {      "max_ngram_diff" : 4    },    "analysis": {      "analyzer": {        "my_analyzer": {          "tokenizer": "my_tokenizer"        }      },      "tokenizer": {        "my_tokenizer": {          "type": "ngram",          "min_gram": 1,          "max_gram": 5,          "token_chars": [            "letter",            "digit"          ]        }      }    }  }}'

假設資料庫每筆記錄有 record_id,title 和 content 三個欄位,其title, content都是中文內容。它們都套用 n-gram analyzer 。

curl -X PUT "localhost:9200/book-ngram/_mapping?pretty" -H 'Content-Type: application/json' -d'{  "properties": {    "title": {      "type": "text",      "analyzer": "my_analyzer",       "fields": {         "keyword": {          "type": "keyword"        }      }    },    "content": {      "type": "text",      "analyzer": "my_analyzer",       "fields": {        "keyword": {          "type": "keyword"        }      }    },    "record_id" : {      "type" : "text",      "fields" : {        "keyword" : {          "type" : "keyword"        }      }    }  }}'

批量上傳內容。(如果要上載json檔,請把 -d'xxx' 改為 --data-binary @FILENAME)

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'{ "index" : { "_index" : "book-ngram" } }{"record_id":"1","title":"紅樓夢","content":"甄士隱夢幻識通靈賈雨村風塵懷閨秀"}{ "index" : { "_index" : "book-ngram" } }{"record_id":"2","title":"西遊記","content":"混沌未分天地亂,茫茫渺渺無人見。自從盤古破鴻蒙,開闢從茲清濁辨。覆載群生仰至仁,發明萬物皆成善。"}{ "index" : { "_index" : "book-ngram" } }{"record_id":"3","title":"水滸傳","content":"張天師祈禳瘟疫洪太尉誤走妖魔"}'

多欄位搜尋,並指定title的權重為content的兩倍。

curl -X GET "localhost:9200/book-ngram/_search?pretty" -H 'Content-Type: application/json' -d'{  "query": {    "multi_match": {      "query" : "開天闢地",      "fields": ["title^2", "content"],      "analyzer": "my_analyzer"    }  }}'

馬交野