일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- flask
- docker
- ELASTIC
- aggs
- token filter test
- Python
- 차트
- License
- Kafka
- aggregation
- zip 암호화
- licence delete curl
- springboot
- sort
- 파이썬
- zip 파일 암호화
- Java
- MySQL
- 900gle
- matplotlib
- plugin
- analyzer test
- Elasticsearch
- Mac
- license delete
- high level client
- API
- TensorFlow
- Test
- query
Archives
- Today
- Total
개발잡부
[tensorflow]text-embeddings 본문
반응형
https://github.com/900gle/text-embeddings
: 문장 임베딩 모델을 dense_vector 필드 유형과 결합하여 유사성 검색에 Elasticsearch 를 사용하는 방법에 대한 간단한 예
개발환경 :
- mac OS
- python3.7.9
- tensorflow1.14
버전확인
pip3 -V
의존성 주입
pip3 install -r requirements.txt
requirements.txt 에 보면
es 7.0.2 버전을 요구 하는데 이미 만들어 놓은 es 환경이 있으니 컨테이너를 실행 해 보자
실행
python3 src/main.py
posts 인덱스
더보기
더보기
더보기
더보기
{
"version":5,
"mapping_version":1,
"settings_version":1,
"aliases_version":1,
"routing_num_shards":1024,
"state":"open",
"settings":{
"index":{
"routing":{
"allocation":{
"include":{
"_tier_preference":"data_content"
}
}
},
"number_of_shards":"2",
"provided_name":"posts",
"creation_date":"1640152158634",
"number_of_replicas":"1",
"uuid":"QJn79J_ZQ_CBHjXTUGrshw",
"version":{
"created":"7120199"
}
}
},
"mappings":{
"_doc":{
"dynamic":"true",
"properties":{
"answerId":{
"type":"keyword"
},
"questionId":{
"type":"keyword"
},
"acceptedAnswerId":{
"type":"keyword"
},
"title_vector":{
"dims":512,
"type":"dense_vector"
},
"body":{
"type":"text"
},
"creationDate":{
"type":"date"
},
"title":{
"type":"text"
},
"type":{
"type":"keyword"
},
"user":{
"type":"keyword"
},
"tags":{
"type":"keyword"
}
}
}
},
"aliases":[
],
"primary_terms":{
"0":1,
"1":1
},
"in_sync_allocations":{
"0":[
"kXcl3xrXRueJjhv4cIWBBw"
],
"1":[
"bATrkQrNSJGAkNKUcGgGIw"
]
},
"rollover_info":{
},
"system":false,
"timestamp_range":{
"unknown":true
}
}
title_vector 가 dence_vector 512 차원으로 맵핑되어 있음..
차원을 어떻게 정했을가..
amazon game 을 입력했는데..
뭐가 유사하다는거여..
파보자
집에서 실행하니
이런 에러가 나오네
[해결]
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
전체코드 main.py
더보기
더보기
더보기
더보기
import json
import time
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
# Use tensorflow 1 behavior to match the Universal Sentence Encoder
# examples (https://tfhub.dev/google/universal-sentence-encoder/2).
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
##### INDEXING #####
def index_data():
print("Creating the 'posts' index.")
client.indices.delete(index=INDEX_NAME, ignore=[404])
with open(INDEX_FILE) as index_file:
source = index_file.read().strip()
client.indices.create(index=INDEX_NAME, body=source)
docs = []
count = 0
with open(DATA_FILE) as data_file:
for line in data_file:
line = line.strip()
doc = json.loads(line)
if doc["type"] != "question":
continue
docs.append(doc)
count += 1
if count % BATCH_SIZE == 0:
index_batch(docs)
docs = []
print("Indexed {} documents.".format(count))
if docs:
index_batch(docs)
print("Indexed {} documents.".format(count))
client.indices.refresh(index=INDEX_NAME)
print("Done indexing.")
def index_batch(docs):
titles = [doc["title"] for doc in docs]
title_vectors = embed_text(titles)
requests = []
for i, doc in enumerate(docs):
request = doc
request["_op_type"] = "index"
request["_index"] = INDEX_NAME
request["title_vector"] = title_vectors[i]
requests.append(request)
bulk(client, requests)
##### SEARCHING #####
def run_query_loop():
while True:
try:
handle_query()
except KeyboardInterrupt:
return
def handle_query():
query = input("Enter query: ")
embedding_start = time.time()
query_vector = embed_text([query])[0]
embedding_time = time.time() - embedding_start
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, doc['title_vector']) + 1.0",
"params": {"query_vector": query_vector}
}
}
}
search_start = time.time()
response = client.search(
index=INDEX_NAME,
body={
"size": SEARCH_SIZE,
"query": script_query,
"_source": {"includes": ["title", "body"]}
}
)
search_time = time.time() - search_start
print()
print("{} total hits.".format(response["hits"]["total"]["value"]))
print("embedding time: {:.2f} ms".format(embedding_time * 1000))
print("search time: {:.2f} ms".format(search_time * 1000))
for hit in response["hits"]["hits"]:
print("id: {}, score: {}".format(hit["_id"], hit["_score"]))
print(hit["_source"])
print()
##### EMBEDDING #####
def embed_text(text):
vectors = session.run(embeddings, feed_dict={text_ph: text})
return [vector.tolist() for vector in vectors]
##### MAIN SCRIPT #####
if __name__ == '__main__':
INDEX_NAME = "posts"
INDEX_FILE = "data/posts/index.json"
DATA_FILE = "data/posts/posts.json"
BATCH_SIZE = 1000
SEARCH_SIZE = 5
GPU_LIMIT = 0.5
print("Downloading pre-trained embeddings from tensorflow hub...")
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
text_ph = tf.placeholder(tf.string)
embeddings = embed(text_ph)
print("Done.")
print("Creating tensorflow session...")
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = GPU_LIMIT
session = tf.Session(config=config)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print("Done.")
client = Elasticsearch(http_auth=('elastic', 'dlengus'))
index_data()
run_query_loop()
print("Closing tensorflow session...")
session.close()
print("Done.")
추가
1. 한글로 테스트를 해봐야 겠음
2. score 값을 비교할 수 있는 환경 구축
반응형
'Python > text embeddings' 카테고리의 다른 글
[tensorflow 2]Universal-sentence-encoder-multilingual-large (0) | 2022.01.13 |
---|---|
[tensorflow 2] sentence encoder A/B test (0) | 2022.01.12 |
[tensorflow 2] Universal-sentence-encoder-multilingual 2 (0) | 2022.01.12 |
[tensorflow 2] universal-sentence-encoder-multilingual (0) | 2022.01.11 |
[tensorflow 2] tf-embeddings 한글버전 (0) | 2021.12.24 |
Comments