반응형
Recent Posts
Recent Comments
관리 메뉴

개발잡부

[tensorflow]text-embeddings 본문

Python/text embeddings

[tensorflow]text-embeddings

닉의네임 2021. 12. 22. 15:29
반응형

 

 

https://github.com/900gle/text-embeddings

: 문장 임베딩 모델을 dense_vector 필드 유형과 결합하여 유사성 검색에 Elasticsearch 를 사용하는 방법에 대한 간단한 예

 

개발환경 :

  • mac OS
  • python3.7.9
  • tensorflow1.14

버전확인

pip3 -V

의존성 주입

pip3 install -r requirements.txt

requirements.txt 에 보면

es 7.0.2 버전을 요구 하는데 이미 만들어 놓은 es 환경이 있으니 컨테이너를 실행 해 보자

 

실행

python3 src/main.py

posts 인덱스

더보기
더보기
더보기
더보기
{
   "version":5,
   "mapping_version":1,
   "settings_version":1,
   "aliases_version":1,
   "routing_num_shards":1024,
   "state":"open",
   "settings":{
      "index":{
         "routing":{
            "allocation":{
               "include":{
                  "_tier_preference":"data_content"
               }
            }
         },
         "number_of_shards":"2",
         "provided_name":"posts",
         "creation_date":"1640152158634",
         "number_of_replicas":"1",
         "uuid":"QJn79J_ZQ_CBHjXTUGrshw",
         "version":{
            "created":"7120199"
         }
      }
   },
   "mappings":{
      "_doc":{
         "dynamic":"true",
         "properties":{
            "answerId":{
               "type":"keyword"
            },
            "questionId":{
               "type":"keyword"
            },
            "acceptedAnswerId":{
               "type":"keyword"
            },
            "title_vector":{
               "dims":512,
               "type":"dense_vector"
            },
            "body":{
               "type":"text"
            },
            "creationDate":{
               "type":"date"
            },
            "title":{
               "type":"text"
            },
            "type":{
               "type":"keyword"
            },
            "user":{
               "type":"keyword"
            },
            "tags":{
               "type":"keyword"
            }
         }
      }
   },
   "aliases":[
      
   ],
   "primary_terms":{
      "0":1,
      "1":1
   },
   "in_sync_allocations":{
      "0":[
         "kXcl3xrXRueJjhv4cIWBBw"
      ],
      "1":[
         "bATrkQrNSJGAkNKUcGgGIw"
      ]
   },
   "rollover_info":{
      
   },
   "system":false,
   "timestamp_range":{
      "unknown":true
   }
}

 

 

title_vector 가 dence_vector 512 차원으로 맵핑되어 있음..

차원을 어떻게 정했을가.. 

 

 

 

 

 

query 를 물어본다

 

amazon game 을 입력했는데..

결과

뭐가 유사하다는거여.. 

 

 

파보자

 

집에서 실행하니 

에러

이런 에러가 나오네

[해결]

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

 

전체코드 main.py

더보기
더보기
더보기
더보기

 

import json
import time

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

# Use tensorflow 1 behavior to match the Universal Sentence Encoder
# examples (https://tfhub.dev/google/universal-sentence-encoder/2).
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
##### INDEXING #####

def index_data():
    print("Creating the 'posts' index.")
    client.indices.delete(index=INDEX_NAME, ignore=[404])

    with open(INDEX_FILE) as index_file:
        source = index_file.read().strip()
        client.indices.create(index=INDEX_NAME, body=source)

    docs = []
    count = 0

    with open(DATA_FILE) as data_file:
        for line in data_file:
            line = line.strip()

            doc = json.loads(line)
            if doc["type"] != "question":
                continue

            docs.append(doc)
            count += 1

            if count % BATCH_SIZE == 0:
                index_batch(docs)
                docs = []
                print("Indexed {} documents.".format(count))

        if docs:
            index_batch(docs)
            print("Indexed {} documents.".format(count))

    client.indices.refresh(index=INDEX_NAME)
    print("Done indexing.")

def index_batch(docs):
    titles = [doc["title"] for doc in docs]
    title_vectors = embed_text(titles)

    requests = []
    for i, doc in enumerate(docs):
        request = doc
        request["_op_type"] = "index"
        request["_index"] = INDEX_NAME
        request["title_vector"] = title_vectors[i]
        requests.append(request)
    bulk(client, requests)

##### SEARCHING #####

def run_query_loop():
    while True:
        try:
            handle_query()
        except KeyboardInterrupt:
            return

def handle_query():
    query = input("Enter query: ")

    embedding_start = time.time()
    query_vector = embed_text([query])[0]
    embedding_time = time.time() - embedding_start

    script_query = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['title_vector']) + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    }

    search_start = time.time()
    response = client.search(
        index=INDEX_NAME,
        body={
            "size": SEARCH_SIZE,
            "query": script_query,
            "_source": {"includes": ["title", "body"]}
        }
    )
    search_time = time.time() - search_start

    print()
    print("{} total hits.".format(response["hits"]["total"]["value"]))
    print("embedding time: {:.2f} ms".format(embedding_time * 1000))
    print("search time: {:.2f} ms".format(search_time * 1000))
    for hit in response["hits"]["hits"]:
        print("id: {}, score: {}".format(hit["_id"], hit["_score"]))
        print(hit["_source"])
        print()

##### EMBEDDING #####

def embed_text(text):
    vectors = session.run(embeddings, feed_dict={text_ph: text})
    return [vector.tolist() for vector in vectors]

##### MAIN SCRIPT #####

if __name__ == '__main__':
    INDEX_NAME = "posts"
    INDEX_FILE = "data/posts/index.json"

    DATA_FILE = "data/posts/posts.json"
    BATCH_SIZE = 1000

    SEARCH_SIZE = 5

    GPU_LIMIT = 0.5

    print("Downloading pre-trained embeddings from tensorflow hub...")
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    text_ph = tf.placeholder(tf.string)
    embeddings = embed(text_ph)
    print("Done.")

    print("Creating tensorflow session...")
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = GPU_LIMIT
    session = tf.Session(config=config)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print("Done.")

    client = Elasticsearch(http_auth=('elastic', 'dlengus'))

    index_data()
    run_query_loop()

    print("Closing tensorflow session...")
    session.close()
    print("Done.")

 

추가

1. 한글로 테스트를 해봐야 겠음

2. score 값을 비교할 수 있는 환경 구축

반응형
Comments