Elastic Search - 텍스트 분석 (5)

Database/NOSQL

Elastic Search - 텍스트 분석 (5)

류큐큐 2024. 4. 16. 16:29

이제 실질적으로 검색을 위한 사전작업을 해야한다.

우선 검색 정확도(3) 포스팅에서 언급한 역인덱스라는것을 기억하는가 엘라스틱 서치는 역 인덱스를 통해 단어들을 term 으로 분리하고 이 term이 어디에 존재하는지 기록한다.

이 역인덱스를 하는 과정에서 텍스트를 분석할때

Analyzer를 사용하고 그 안엔 Character Filter, Tokenizer, Token Filter 가 있다고 했었다.

다시한번 간단하게 설명해보자면
Character Filter 는 맨 처음 들어오는 단어들에 대해 특정 가공을 한다.
Tokenizer 는 이 들어온 단어들을 term으로 나눈다.
Token Filter 는 이 분리된 term들을 가공하여 역 인덱스를 한다.

아무것도 설정 안한 엘라스틱서치가 해당 단어를 어떻게 분석하고 term으로 분리하는지 보자.

POST http://localhost:9200/search_index/_analyze

{
  "text": "메종 키츠네"
}

{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<HANGUL>",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "<HANGUL>",
            "position": 1
        }
    ]
}

메종 키츠네 라는 단어를 넣으면 메종 과 키츠네 두개의 term으로 분리 된다.

그렇다면 띄어쓰기없이 메종키츠네를 넣으면 어떻게 될까

POST http://localhost:9200/search_index/_analyze

{
  "text": "메종키츠네"
}

{
    "tokens": [
        {
            "token": "메종키츠네",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<HANGUL>",
            "position": 0
        }
    ]
}

메종키츠네 하나의 term으로 분석된다.

그러면 메종키츠네로 검색했을때와 메종 키츠네로 검색했을때의 결과 값이 다르게 나올것이다.

이러한 이유로 보통 커스텀 애널라이저를 만들고 단어들을 검색에 용이하게 만든다.

그럼 몇가지 상황을 가정해보자

POST http://localhost:9200/search_index/_bulk

{"index":{"_id":1}}
{"message":"Maison Kitsune"}
{"index":{"_id":2}}
{"message":"MaisonKitsune"}
{"index":{"_id":3}}
{"message":"메종 키츠네"}
{"index":{"_id":4}}
{"message":"메종키츠네"}
{"index":{"_id":5}}
{"message":"메종 마르지엘라"}
{"index":{"_id":6}}
{"message":"메종마르지엘라"}
{"index":{"_id":7}}
{"message":"Maison Margiela"}
{"index":{"_id":8}}
{"message":"MaisonMargiela"}

위와 같이 메종 키츠네, 메종 마르지엘라 비슷한 브랜드 이름을 엘라스틱 서치에 인덱싱 시켜놓고 위의 단어들을 구글에 검색해보면

_id 1~ 4까지 모두 같은 결과가 나오고 _id 5~8까지 모두 같은 결과가 나온다.

또한 메종이란 단어만 검색했을때도 검색 결과 최상단은 아니더라도 중간쯤에 메종 키츠네와 메종 마르지엘라의 검색결과가 나온다.

그럼 생각해봤을때

Tokenizer 가 term들을 분리하고 Token Filter가 역 인덱싱할때 무언가 가공을 해야할거같은 느낌이 온다.

이 Token Filter에는 여러가지가 있는데 대소문자를 모두 소문자 또는 대문자로 바꾸는 lowercase 또는 uppercase가 있고
검색에 필요하지 않은 단어를 제거해주는 stop 또 동의어를 제공하는 synonym, 또 단어를 잘게 나눠서 역 인덱싱하는 ngram, edge nagram, shingle등이 있는데

우선 우리는 lowercase와 synonym을 사용하여 단어를 소문자로 바꾸고 동의어를 제공해보자.

우선 기존 인덱스 search_index를 삭제해주고

DELETE http://localhost:9200/_search_index

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_nori_tokenizer": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed",
          "user_dictionary_rules": [
            "메종",
            "키츠네",
            "메종 키츠네",
            "메종 마르지엘라",
            "마르지엘라"
          ]
        }
      },
      "filter": {
        "my_syn": {
          "type": "synonym",
          "synonyms": [
            "maison kitsune, maisonkitsune, 메종 키츠네, 메종키츠네",
            "메종 마르지엘라, 메종마르지엘라, 마르지엘라, mm6, maisonmargiela, maison margiela"
          ]
        },
        "my_pos_f": {
            "type": "nori_part_of_speech",
            "stoptags": [
               "E", "IC", "J", "MAG", "MAJ",
                "MM", "SP", "SSC", "SSO", "SC",
                "SE", "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
          }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "my_nori_tokenizer",
          "filter": [
            "lowercase",
            "my_pos_f",
            "my_syn"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

tokenizer에는 한글 형태소 분석기인 nori를 사용했고

Token Filter에는 lowercase, m_pos_f , my_syn 을 만들고
my_syn는 내가 동의어로 지정하고싶은애들을 저렇게 넣어놨다.

참고로 m_pos_f는 한글에 있는 불필요한 품사를 제거하는데 사용되는데
궁금하면 공식문서를 참고하면 된다.
https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori
나는 그냥 디폴트로 들어가있는거 고대로 사용했다.

자 그럼 내가 만든 my_custom_analyzer는 메종키츠네라는 단어를 어떻게 분석하는지 보자.
방금 만든 analyzer를 적용하면

확실히 아까보단 뭔가 많은 단어들이 추가적으로 적용된게 보인다.

POST http://localhost:9200/search_index/_analyze
{
  "text": "메종키츠네",
  "analyzer": "my_custom_analyzer"
}

response

{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "maison",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "maisonkitsune",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "kitsune",
            "start_offset": 2,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "키츠네",
            "start_offset": 2,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 1
        }
    ]
}

POST http://localhost:9200/search_index/_analyze
{
  "text": "메종 키츠네",
  "analyzer": "my_custom_analyzer"
}

respones
{
    "tokens": [
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "maison",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "maisonkitsune",
            "start_offset": 0,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "메종",
            "start_offset": 0,
            "end_offset": 2,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "word",
            "position": 1
        },
        {
            "token": "kitsune",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "키츠네",
            "start_offset": 3,
            "end_offset": 6,
            "type": "SYNONYM",
            "position": 1
        }
    ]
}

그럼 이제 메종키츠네와 메종 키츠네를 다시 검색해보자

POST http://localhost:9200/search_index/_search

{  
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "message": "메종키츠네"
                }
            }
        ]
    }
  }
}

response
{
    "took": 11,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 12.004257,
        "hits": [
            {
                "_index": "set_of_search_index",
                "_id": "3",
                "_score": 12.004257,
                "_source": {
                    "message": "메종 키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "4",
                "_score": 12.004257,
                "_source": {
                    "message": "메종키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "1",
                "_score": 10.636996,
                "_source": {
                    "message": "Maison Kitsune"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "2",
                "_score": 10.636996,
                "_source": {
                    "message": "MaisonKitsune"
                }
            }
        ]
    }
}

POST http://localhost:9200/search_index/_search

{  
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "message": "메종 키츠네"
                }
            }
        ]
    }
  }
}

response

{
    "took": 20,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 12.004257,
        "hits": [
            {
                "_index": "set_of_search_index",
                "_id": "3",
                "_score": 12.004257,
                "_source": {
                    "message": "메종 키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "4",
                "_score": 12.004257,
                "_source": {
                    "message": "메종키츠네"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "1",
                "_score": 10.636996,
                "_source": {
                    "message": "Maison Kitsune"
                }
            },
            {
                "_index": "set_of_search_index",
                "_id": "2",
                "_score": 10.636996,
                "_source": {
                    "message": "MaisonKitsune"
                }
            }
        ]
    }
}

이제 "메종키츠네" , "메종 키츠네" 로 검색했을때 아까 동의어로 입력한 단어까지 같이 나오는걸 확인할 수 있다.

역인덱싱이 어떻게 되어있는지 한번 확인해보면 아래와 같이 잘 되어있다.

GET http://localhost:9200/search_index/_termvectors/4?fields=message

respones

{
    "_index": "set_of_search_index",
    "_id": "4",
    "_version": 1,
    "found": true,
    "took": 16,
    "term_vectors": {
        "message": {
            "field_statistics": {
                "sum_doc_freq": 48,
                "doc_count": 8,
                "sum_ttf": 60
            },
            "terms": {
                "kitsune": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        }
                    ]
                },
                "maison": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        }
                    ]
                },
                "maisonkitsune": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 5
                        }
                    ]
                },
                "메종": {
                    "term_freq": 2,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        },
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 2
                        }
                    ]
                },
                "키츠네": {
                    "term_freq": 2,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        },
                        {
                            "position": 1,
                            "start_offset": 2,
                            "end_offset": 5
                        }
                    ]
                }
            }
        }
    }
}

이렇게 색인화 과정을 거치면 검색을 위한 준비 과정은 끝난다.

실무에선 더 많은 단어와 유사어가 있겠지만 해당 글을 참고하여 검색 기능 구현에 도움이 되길 바란다.

'Database > NOSQL' 카테고리의 다른 글

Elastic Search - 정확도 쿼리 (4) (0)	2024.04.05
Elastic Search - 검색 정확도(3) (1)	2024.04.04
Elastic Search - 기초 쿼리 (2) (0)	2024.04.04
Elastic Search - 기초 쿼리 (1) (0)	2024.04.04

현재글Elastic Search - 텍스트 분석 (5)

류큐큐 개발 일지

https://github.com/Ryu-qqq

스프링, Elastic Search, 스프링 시큐리티, 엘라스틱 서치, AI Pair Programming, spring, Claude, 이지랜덤, 테라폼, 클로드 코드, 클로드코드, 토스 러너스하이, claude code, 쿠버네티스, easyRandom, 토스 Learner's High, AI 개발, spring security, terraform, 클로드,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

류큐큐 개발 일지