科技新知
快速做用 elasticsearch 做中文 n-gram 關鍵字全文搜尋
潮流特區
有些時候,我們對一些文章資料,光是使用Ctrl-F文字區配搜尋,很難找到完全吻合的結果。這時候,我們可以試試看快速搭建自己的中文搜尋引擎,看看能不能更易地找到資料。而中文搜尋引擎,其實用免費的elasticsearch也可以做到。我們就來看看怎樣快速起lab吧。 經 docker 下載及運行 elasticsearch docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch \ -e "discovery.type=single-node" \ -e "xpack.security.enabled=false" \ -e "xpack.license.self_generated.type=basic" \ -v "elasticsearch-data:/usr/share/elasticsearch/data" \ docker.elastic.co/elasticsearch/elasticsearch:8.17.0 建立資料庫。在elasticsearch 中,示作index,並建立自己的n-gram analyzer和tokenizer。 curl -X PUT "localhost:9200/book-ngram?pretty" -H 'Content-Type: application/json' -d' { "settings": { "index" : { "max_ngram_diff" : 4 }, "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "ngram", "min_gram": 1, "max_gram": 5, "token_chars": [ "letter", "digit" ] } } } } } ' 假設資料庫每筆記錄有 record_id,title 和 content 三個欄位,其title, content都是中文內容。它們都套用 n-gram analyzer 。 curl -X PUT "localhost:9200/book-ngram/_mapping?pretty" -H 'Content-Type: application/json' -d' { "properties": { "title": { "type": "text", "analyzer": "my_analyzer", "fields": { "keyword": { "type": "keyword" } } }, "content": { "type": "text", "analyzer": "my_analyzer", "fields": { "keyword": { "type": "keyword" } } }, "record_id" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword" } } } } } ' 批量上傳內容。(如果要上載json檔,請把 -d'xxx' 改為 --data-binary @FILENAME) curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d' { "index" : { "_index" : "book-ngram" } } {"record_id":"1","title":"紅樓夢","content":"甄士隱夢幻識通靈賈雨村風塵懷閨秀"} { "index" : { "_index" : "book-ngram" } } {"record_id":"2","title":"西遊記","content":"混沌未分天地亂,茫茫渺渺無人見。自從盤古破鴻蒙,開闢從茲清濁辨。覆載群生仰至仁,發明萬物皆成善。"} { "index" : { "_index" : "book-ngram" } } {"record_id":"3","title":"水滸傳","content":"張天師祈禳瘟疫洪太尉誤走妖魔"} ' 多欄位搜尋,並指定title的權重為content的兩倍。 curl -X GET "localhost:9200/book-ngram/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "multi_match": { "query" : "開天闢地", "fields": ["title^2", "content"], "analyzer": "my_analyzer" } } } '