Recent Posts
Recent Comments
관리 메뉴

개발잡부

[es] 검색결과를 검증해보자 본문

ElasticStack/Elasticsearch

[es] 검색결과를 검증해보자

닉의네임 2022. 1. 21. 11:29
SMALL

 

  • True Positive(TP) : 실제 True인 정답을 True라고 예측 (정답)
  • False Positive(FP) : 실제 False인 정답을 True라고 예측 (오답)
  • False Negative(FN) : 실제 True인 정답을 False라고 예측 (오답)
  • True Negative(TN) : 실제 False인 정답을 False라고 예측 (정답)

 

정밀도 (Precision)

검색결과로 가져온 문서 중 실제 관련된 문서의 비율

 

재현율 (Recall)

관련된 문서중 검색된 문서의 비율

 

성능평가 알고리즘

nDCG

  • CG = 추천결과들은 동일한 비중으로 계산
  • DCG = 랭킨순서에따라 비중을 줄여 관련도를 계산
  • nDCG = 전체데이터에 대한 best DCG 를 계산

 

require.txt 에 pandas 추가

elasticsearch
numpy
tensorflow
tensorflow-hub
tensorflow_text
kss
regex
flask
flask_restful
Api
Resource
matplotlib
pandas

 

query_test.py

# -*- coding: utf-8 -*-

import time
import math

from elasticsearch import Elasticsearch

import tensorflow_hub as hub
import tensorflow_text

import matplotlib.pyplot as plt
import numpy as np

##### SEARCHING #####

def handle_query():
    query = "나이키 남성 신발"
    embedding_start = time.time()
    query_vector = embed_text([query])[0]
    embedding_time = time.time() - embedding_start

    script_query = {
        "function_score": {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": [
                        "name",
                        "category^2"
                    ]
                }
            },
            "functions": [
                {
                    "script_score": {
                        "script": {
                            "source": "cosineSimilarity(params.query_vector, 'feature_vector') * doc['weight'].value * doc['populr'].value / doc['name'].length + doc['category'].length",
                            "params": {
                                "query_vector": query_vector
                            }
                        }
                    },
                    "weight": 0.1
                }
            ]
        }
    }

    search_start = time.time()
    response = client.search(
        index=INDEX_NAME,
        body={
            "size": SEARCH_SIZE,
            "query": script_query,
            "_source": {"includes": ["name", "category"]}
        }
    )
    search_time = time.time() - search_start

    print()
    print("{} total hits.".format(response["hits"]["total"]["value"]))
    print("embedding time: {:.2f} ms".format(embedding_time * 1000))
    print("search time: {:.2f} ms".format(search_time * 1000))


    for hit in response["hits"]["hits"]:
        print("id: {}, score: {}".format(hit["_id"], hit["_score"]))
        print(hit["_source"])
        print()

    # print(response["hits"]["max_score"])
    x = np.arange(0, SEARCH_SIZE, 1)
    y = [hit["_score"] for hit in response["hits"]["hits"]]

    plt.xlim([1, SEARCH_SIZE])      # X축의 범위: [xmin, xmax]
    plt.ylim([0, math.ceil(response["hits"]["max_score"])])     # Y축의 범위: [ymin, ymax]
    plt.xlabel('top 10', labelpad=2)
    plt.ylabel('score', labelpad=2)
    plt.plot(x, y, label='query1', color='#e35f62', marker='*', linewidth=1 )
    plt.legend()
    plt.title('Query score')
    plt.xticks(x)
    plt.yticks(np.arange(1, math.ceil(response["hits"]["max_score"])))
    plt.grid(True)
    plt.show()

##### EMBEDDING #####

def embed_text(input):
    vectors = model(input)
    return [vector.numpy().tolist() for vector in vectors]


##### MAIN SCRIPT #####

if __name__ == '__main__':
    INDEX_NAME = "products_r"

    SEARCH_SIZE = 10
    print("Downloading pre-trained embeddings from tensorflow hub...")
    model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
    client = Elasticsearch(http_auth=('elastic', 'dlengus'))

    handle_query()

    print("Done.")

 

 

top10 Score

 

음.. 만들고 보니 또 쓰잘때기 없는걸 만들었네.. 스코어를 비교해서 뭐해..

 

LIST
Comments