[es] 로그에서 외래어 추출

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

개발잡부

[es] 로그에서 외래어 추출 본문

카테고리 없음

[es] 로그에서 외래어 추출

닉의네임 2025. 3. 26. 08:24

외래어 추출.. 을 하라고 한다..

애매하다..

로그에서 외래어라고 판단하는건 사람이 해야 하는데.. 추출을 하면 그게 또 근거를 제시해야 하는..

아무튼..

생각해낸 방법은

한국어 어문규범

여기에 등록된 67,184개의 외래어를 이용했다.

일단 이걸 색인으로 만들었다. 시간이 없어서 지피티 선생님을 활용해 인덱스 생성 쿼리를 급하게 만들고

색인쿼리도 급 제작

최근 검색어 최근 1주일(3/13 ~ 3/20)

최근검색어 (실패검색어로 변경) 에서 상위 30,000개 추출
추출된 검색어와 외래어를 비교하여 추출 (추출기준 하단 참고)
공백없이 한단어로 이루어진 단어 (5,322개)
추출된 외래어와 편집거리 1에서 비슷한 단어 검색로그에서 추출 (유입수 10 이상)
- 작업자의 판단으로 진행여부 결정 (약 250개 예상)

외래어 기준

한국어 어문 규범에 등록된 67,184개의 외래어와 비교하여 추출

konlpy mecab 형태소 분석을 통한 전처리 (일반명사와 고유명사로 분리해서 처리 )
1. 일반명사(NNG)
  1. 단어 사이에 공백 추가 (복합어를 multi_match 하기 위함)
  2. 한국어 어문 규범 인덱스에 조회 결과로 외래어 판단
    1. fuzziness : "AUTO" 의 multi_match
      1. 편집거리
        
        길이 1~2의 단어: 편집 거리 0 (완전히 동일해야 검색)
        
        길이 3~5의 단어: 편집 거리 최대 1 허용
        
        길이 6 이상의 단어: 편집 거리 최대 2 허용
      2. 분해된 텀의 70% 이상 일치하는 경우 검색결과로 인정
      3. 검색어 한글자 이상
      4. 한국어 어문 규범 조회결과 한글자 이상
    2. lanquage: 영어
    3. 인명 제외
  3. 검색결과가 한글자로 이루어진 단어인 경우 외래어 후보 처리
2. 고유명사(NNP)
  1. 외래어 후보 처리 (사람의 판단 필요)
3. 한글만으로 이루어진 키워드

결과

PUT /loanwords
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "refresh_interval": "30h"
  },
  "mappings": {
    "properties": {
      "id": { "type": "integer" },
      "type": { "type": "keyword" },
      "korean_name": {
        "type": "text",
        "fields": {
          "raw": { "type": "keyword" }
        }
      },
      "original_name": {
        "type": "text",
        "fields": {
          "raw": { "type": "keyword" }
        }
      },
      "country": { "type": "keyword" },
      "language": { "type": "keyword" },
      "alt_spelling_1": { "type": "text" },
      "alt_spelling_2": { "type": "text" },
      "alt_spelling_3": { "type": "text" },
      "wrong_spelling_1": { "type": "text" },
      "wrong_spelling_2": { "type": "text" },
      "wrong_spelling_3": { "type": "text" },
      "wrong_spelling_4": { "type": "text" },
      "wrong_spelling_5": { "type": "text" },
      "meaning": { "type": "text" },
      "source": { "type": "text" },
      "public": { "type": "keyword" }
    }
  }
}

# -*- coding: utf-8 -*-

import json
import time
import datetime as dt
import urllib3
import config
import pandas as pd
import csv
from elasticsearch import Elasticsearch, helpers

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def connected():
    try:
       info = client.info()
       version = info['version']['number']
       print(f"Elasticsearch 버전: {version}")
    except Exception as e:
       print(f"Elasticsearch에 연결할 수 없습니다: {e}")

def extrect():
    print("extrect start")

    with open(DIC_FILE, "r", encoding="utf-8-sig") as dic:
        csv_reader = csv.reader(dic)
        for row in csv_reader:
            row = [cell.strip() for cell in row]

            print(row)  # 각 행을 리스트로 출력

            document = {
                "id": int(row[0]),  # 번호 (ID)
                "type": row[1],  # 구분
                "korean_name": row[2],  # 한글 표기
                "original_name": row[3],  # 원어 표기
                "country": row[4],  # 국명
                "language": row[5],  # 언어명
                "alt_spelling_1": row[6],  # 이표기1
                "alt_spelling_2": row[7],  # 이표기2
                "alt_spelling_3": row[8],  # 이표기3
                "wrong_spelling_1": row[9],  # 오표기1
                "wrong_spelling_2": row[10],  # 오표기2
                "wrong_spelling_3": row[11],  # 오표기3
                "wrong_spelling_4": row[12],  # 오표기4
                "wrong_spelling_5": row[13],  # 오표기5
                "meaning": row[14],  # 의미
                "source": row[15],  # 출전
                "public": row[16]  # 공개 여부
            }
            action = {
                "_op_type": "index",  # "index"는 데이터를 색인하는 작업
                "_index": INDEX_NAME,  # 인덱스 이름
                "_source": document  # 실제 데이터
            }

            # bulk API를 사용하여 데이터 색인
            helpers.bulk(client, [action])
        client.indices.refresh(index=INDEX_NAME)
        print("데이터 색인 완료!")


##### MAIN SCRIPT #####
if __name__ == '__main__':
    client = Elasticsearch("https:///", ca_certs=False,
                   verify_certs=False)
    connected()
    START = f"{config.START}"
    END = f"{config.END}"
    SIZE = f"{config.SIZE}"

    STEP1_EXTRACT_RESULT_FILE = "./result/step1_extrect_result.txt"
    STEP2_EVENT_KEY_MATCH_RESULT_FILE = "./result/step2_evnet_key_match_result.txt"
    STEP3_ADD_COUNT_RESULT_FILE = "./result/step3_add_count_result.txt"

    KOREAN_WORD = "./query/korean_word.txt"

    MAIN_KEY = "./query/keyword.txt" #base 키워드

    DIC_FILE = "./query/foregin_dic.csv"

    INDEX_NAME = "loanwords"


    QUERY_FILE = "./query/query.json"
    TERM_QUERY_FILE = "./query/term_query.json"
    ADD_COUNT_KEY = "./query/add_count_keyword.txt"


    extrect()

    print("Done.")

치킨을 돌려보면

이렇게 나온다.

# -*- coding: utf-8 -*-

import json
import time
import datetime as dt
import urllib3
import config
from elasticsearch import Elasticsearch
from konlpy.tag import Okt, Mecab
import re

okt = Okt()
mecab = Mecab()

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def insert_spaces(keyword):
    okt = Okt()
    # 형태소 분석을 통해 단어 분리
    words = okt.morphs(keyword)
    return ' '.join(words)

def connected():
    try:
       info = client.info()
       version = info['version']['number']
       print(f"Elasticsearch 버전: {version}")
    except Exception as e:
       print(f"Elasticsearch에 연결할 수 없습니다: {e}")

def is_only_korean(text):
    return bool(re.match('^[가-힣\s]+$', text))
def has_two_or_more_words(text):
    words = text.strip().split()
    return len(words) >= 2

def extrect():
    print("extrect start")
    step1_extrect_array = set()
    with open(FAIL_QUERY_FILE) as query_file:
        query_source = query_file.read().strip()
        query_source = query_source.replace("${start}", START)
        query_source = query_source.replace("${end}", END)
        query = query_source.replace("${size}", SIZE)
        response = client.search(index=FAIL_INDEX_NAME, body=query)
        for val in response['aggregations']['KEYWORD']['buckets']:
            if not has_two_or_more_words(val['key']):
                loan = check_loanword(val['key'])
                if loan is None:
                    print("None: "+val['key']+","+str(val['doc_count']) )
                else:
                    if (
                        is_only_korean(val['key']) and
                        len(val['key']) > 1 and
                        len(loan['korean_name']) > 1 and
                        not has_two_or_more_words(loan['korean_name'])
                    ):

                        step1_extrect_array.add(val['key']+","+str(val['doc_count'])+","+check_loanword_list(val['key'])+","+loan['korean_name']+","+loan['tag'] +"\n")


    with open(STEP1_EXTRACT_RESULT_FILE, "w", encoding="utf-8") as outfile:
        outfile.writelines(step1_extrect_array)

    print(f"파일생성: "+STEP1_EXTRACT_RESULT_FILE)


def check_loanword_list(keyword):
    ret_array = set()
    with open(FUZZY_QUERY_FINAL_FILE) as query_file:
        query_source = query_file.read().strip()
        query_source = query_source.replace("${start}", START)
        query_source = query_source.replace("${end}", END)
        query_source = query_source.replace("${length}", str(len(keyword)))
        query = query_source.replace("${keyword}", keyword)
        response = client.search(index=INDEX_NAME, body=query)
        for val in response['aggregations']['KEYWORD']['buckets']:
            if (val['doc_count'] > 10):
                ret_array.add(val['key'])

    return (' | ').join(ret_array)



def check_loanword(keyword):
    with open(FUZZY_QUERY_FILE) as query_file:
        query_source = query_file.read().strip()
        result = analyze_with_mecab(keyword)
        if len(result["common_nouns"]) > 0 :
            print("common_nouns: " + str(result["common_nouns"]))
            common_nouns = result["common_nouns"]
            search_word = ' '.join(common_nouns)
            query = query_source.replace("${keyword}", search_word)
            response = client.search(index=LOAN_INDEX_NAME, body=query)
            for hit in response["hits"]["hits"]:
                if(hit["_source"]):
                    return { "korean_name" : hit["_source"]["korean_name"], "tag" : "NNG" }
            return None
        elif len(result["proper_nouns"]) > 0:
            return { "korean_name" : keyword, "tag" : "NNP" }
        else:
            return None


def analyze_with_mecab(text):
    mecab = Mecab()
    pos = mecab.pos(text)
    proper_nouns = []   # 고유명사 (NNP)
    common_nouns = []   # 일반명사 (NNG)
    others = []         # 그 외 품사
    for word, tag in pos:
        if tag == 'NNP':
            proper_nouns.append(word)
        elif tag == 'NNG':
            common_nouns.append(word)
        else:
            others.append((word, tag))
    return {
        "proper_nouns": proper_nouns,
        "common_nouns": common_nouns,
        "others": others
    }

def get_type():
    if len(sys.argv) < 1:
        print("사용법: python extrect.py full")
    else:
        print("첫 번째 인자:", sys.argv[1])
        return sys.argv[1]



##### MAIN SCRIPT #####
if __name__ == '__main__':
    client = Elasticsearch("https://", ca_certs=False,
                   verify_certs=False)
    connected()
    START = f"{config.START}"
    END = f"{config.END}"
    SIZE = f"{config.SIZE}"

    STEP1_EXTRACT_RESULT_FILE = "./result/step1_extrect_result.txt"
    STEP2_EXTRACT_RESULT_FILE = "./result/step2_extrect_result.txt"

    INDEX_NAME = "home-search-query-log"
    FAIL_INDEX_NAME = "home-failure-querylog-2025.03"
    LOAN_INDEX_NAME = "loanwords"
    QUERY_FILE = "./query/query.json"
    FAIL_QUERY_FILE = "./query/fail_query.json"
    FUZZY_QUERY_FILE = "./query/fuzzy_query.json"
    FUZZY_QUERY_FINAL_FILE = "./query/fuzzy_query_final.json"

    extrect()

#     keyword = "즉석카레"
#     print(check_loanword_list(keyword))
#     keyword = "아이스크림"
#     print(check_loanword(keyword))
    print("Done.")

Comments

개발잡부

[es] 로그에서 외래어 추출 본문

[es] 로그에서 외래어 추출

티스토리툴바