Elastic Search - 텍스트 분석 (5)

Database/NOSQL

Elastic Search - 텍스트 분석 (5)

류큐큐 2024. 4. 16. 16:29

이제 실질적으로 검색을 위한 사전작업을 해야한다.

우선 검색 정확도(3) 포스팅에서 언급한 역인덱스라는것을 기억하는가 엘라스틱 서치는 역 인덱스를 통해 단어들을 term 으로 분리하고 이 term이 어디에 존재하는지 기록한다.

이 역인덱스를 하는 과정에서 텍스트를 분석할때

Analyzer를 사용하고 그 안엔 Character Filter, Tokenizer, Token Filter 가 있다고 했었다.

다시한번 간단하게 설명해보자면
Character Filter 는 맨 처음 들어오는 단어들에 대해 특정 가공을 한다.
Tokenizer 는 이 들어온 단어들을 term으로 나눈다.
Token Filter 는 이 분리된 term들을 가공하여 역 인덱스를 한다.

아무것도 설정 안한 엘라스틱서치가 해당 단어를 어떻게 분석하고 term으로 분리하는지 보자.

POST http://localhost:9200/search_index/_analyze

{
  "text": "메종 키츠네"
}

{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<HANGUL>",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "<HANGUL>",
            "position": 1
        }
    ]
}

메종 키츠네 라는 단어를 넣으면 메종 과 키츠네 두개의 term으로 분리 된다.

그렇다면 띄어쓰기없이 메종키츠네를 넣으면 어떻게 될까

POST http://localhost:9200/search_index/_analyze

{
  "text": "메종키츠네"
}

{
    "tokens": [
        {
            "token": "메종키츠네",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<HANGUL>",
            "position": 0
        }
    ]
}

메종키츠네 하나의 term으로 분석된다.

그러면 메종키츠네로 검색했을때와 메종 키츠네로 검색했을때의 결과 값이 다르게 나올것이다.

이러한 이유로 보통 커스텀 애널라이저를 만들고 단어들을 검색에 용이하게 만든다.

그럼 몇가지 상황을 가정해보자

POST http://localhost:9200/search_index/_bulk

{"index":{"_id":1}}
{"message":"Maison Kitsune"}
{"index":{"_id":2}}
{"message":"MaisonKitsune"}
{"index":{"_id":3}}
{"message":"메종 키츠네"}
{"index":{"_id":4}}
{"message":"메종키츠네"}
{"index":{"_id":5}}
{"message":"메종 마르지엘라"}
{"index":{"_id":6}}
{"message":"메종마르지엘라"}
{"index":{"_id":7}}
{"message":"Maison Margiela"}
{"index":{"_id":8}}
{"message":"MaisonMargiela"}

위와 같이 메종 키츠네, 메종 마르지엘라 비슷한 브랜드 이름을 엘라스틱 서치에 인덱싱 시켜놓고 위의 단어들을 구글에 검색해보면

_id 1~ 4까지 모두 같은 결과가 나오고 _id 5~8까지 모두 같은 결과가 나온다.

또한 메종이란 단어만 검색했을때도 검색 결과 최상단은 아니더라도 중간쯤에 메종 키츠네와 메종 마르지엘라의 검색결과가 나온다.

그럼 생각해봤을때

Tokenizer 가 term들을 분리하고 Token Filter가 역 인덱싱할때 무언가 가공을 해야할거같은 느낌이 온다.

이 Token Filter에는 여러가지가 있는데 대소문자를 모두 소문자 또는 대문자로 바꾸는 lowercase 또는 uppercase가 있고
검색에 필요하지 않은 단어를 제거해주는 stop 또 동의어를 제공하는 synonym, 또 단어를 잘게 나눠서 역 인덱싱하는 ngram, edge nagram, shingle등이 있는데

우선 우리는 lowercase와 synonym을 사용하여 단어를 소문자로 바꾸고 동의어를 제공해보자.

우선 기존 인덱스 search_index를 삭제해주고

DELETE http://localhost:9200/_search_index

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_nori_tokenizer": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed",
          "user_dictionary_rules": [
            "메종",
            "키츠네",
            "메종 키츠네",
            "메종 마르지엘라",
            "마르지엘라"
          ]
        }
      },
      "filter": {
        "my_syn": {
          "type": "synonym",
          "synonyms": [
            "maison kitsune, maisonkitsune, 메종 키츠네, 메종키츠네",
            "메종 마르지엘라, 메종마르지엘라, 마르지엘라, mm6, maisonmargiela, maison margiela"
          ]
        },
        "my_pos_f": {
            "type": "nori_part_of_speech",
            "stoptags": [
               "E", "IC", "J", "MAG", "MAJ",
                "MM", "SP", "SSC", "SSO", "SC",
                "SE", "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
          }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "my_nori_tokenizer",
          "filter": [
            "lowercase",
            "my_pos_f",
            "my_syn"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

tokenizer에는 한글 형태소 분석기인 nori를 사용했고

Token Filter에는 lowercase, m_pos_f , my_syn 을 만들고
my_syn는 내가 동의어로 지정하고싶은애들을 저렇게 넣어놨다.

참고로 m_pos_f는 한글에 있는 불필요한 품사를 제거하는데 사용되는데
궁금하면 공식문서를 참고하면 된다.
https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori
나는 그냥 디폴트로 들어가있는거 고대로 사용했다.

자 그럼 내가 만든 my_custom_analyzer는 메종키츠네라는 단어를 어떻게 분석하는지 보자.
방금 만든 analyzer를 적용하면

확실히 아까보단 뭔가 많은 단어들이 추가적으로 적용된게 보인다.

POST http://localhost:9200/search_index/_analyze
{
  "text": "메종키츠네",
  "analyzer": "my_custom_analyzer"
}

response

{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "maison",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "maisonkitsune",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "kitsune",
            "start_offset": 2,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "키츠네",
            "start_offset": 2,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 1
        }
    ]
}

POST http://localhost:9200/search_index/_analyze
{
  "text": "메종 키츠네",
  "analyzer": "my_custom_analyzer"
}

respones
{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "maison",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "maisonkitsune",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "word",
            "position": 1
        },
        {
            "token": "kitsune",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        }
    ]
}

그럼 이제 메종키츠네와 메종 키츠네를 다시 검색해보자

POST http://localhost:9200/search_index/_search

{  
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "message": "메종키츠네"
                }
            }
        ]
    }
  }
}

response
{
    "took": 11,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 12.004257,
        "hits": [
            {
                "_index": "set_of_search_index",
                "_id": "3",
                "_score": 12.004257,
                "_source": {
                    "message": "메종 키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "4",
                "_score": 12.004257,
                "_source": {
                    "message": "메종키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "1",
                "_score": 10.636996,
                "_source": {
                    "message": "Maison Kitsune"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "2",
                "_score": 10.636996,
                "_source": {
                    "message": "MaisonKitsune"
                }
            }
        ]
    }
}

POST http://localhost:9200/search_index/_search

{  
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "message": "메종 키츠네"
                }
            }
        ]
    }
  }
}

response

{
    "took": 20,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 12.004257,
        "hits": [
            {
                "_index": "set_of_search_index",
                "_id": "3",
                "_score": 12.004257,
                "_source": {
                    "message": "메종 키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "4",
                "_score": 12.004257,
                "_source": {
                    "message": "메종키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "1",
                "_score": 10.636996,
                "_source": {
                    "message": "Maison Kitsune"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "2",
                "_score": 10.636996,
                "_source": {
                    "message": "MaisonKitsune"
                }
            }
        ]
    }
}

이제 "메종키츠네" , "메종 키츠네" 로 검색했을때 아까 동의어로 입력한 단어까지 같이 나오는걸 확인할 수 있다.

역인덱싱이 어떻게 되어있는지 한번 확인해보면 아래와 같이 잘 되어있다.

GET http://localhost:9200/search_index/_termvectors/4?fields=message

respones

{
    "_index": "set_of_search_index",
    "_id": "4",
    "_version": 1,
    "found": true,
    "took": 16,
    "term_vectors": {
        "message": {
            "field_statistics": {
                "sum_doc_freq": 48,
                "doc_count": 8,
                "sum_ttf": 60
            },
            "terms": {
                "kitsune": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        }
                    ]
                },
                "maison": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        }
                    ]
                },
                "maisonkitsune": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 5
                        }
                    ]
                },
                "메종": {
                    "term_freq": 2,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        },
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        }
                    ]
                },
                "키츠네": {
                    "term_freq": 2,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        },
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        }
                    ]
                }
            }
        }
    }
}

이렇게 색인화 과정을 거치면 검색을 위한 준비 과정은 끝난다.

실무에선 더 많은 단어와 유사어가 있겠지만 해당 글을 참고하여 검색 기능 구현에 도움이 되길 바란다.