728x90

텍스트 전처리¶

자연어 처리 기법에 적용하기 위해 용도에 맞게 텍스트를 사전에 처리하는 작업

토큰화 & 정제 & 정규화 과정이 있습니다.

Tokenization¶

Word Tokenization¶

In [1]:

from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer # 구두점을 별도로 분류함
from tensorflow.keras.preprocessing.text import text_to_word_sequence

In [2]:

# 아스트로피(') 처리
print("word_token_1 : ", word_tokenize("They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't holds with such nonsence"))

word_token_1 :  ['They', 'were', 'the', 'last', 'people', 'you', "'d", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'did', "n't", 'holds', 'with', 'such', 'nonsence']

In [3]:

# WordPunctTokenizer는 구두점을 별도로 분류하는 특징
print("word_token_2 : ", WordPunctTokenizer().tokenize("They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't holds with such nonsence"))

word_token_2 :  ['They', 'were', 'the', 'last', 'people', 'you', "'", 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', "'", 't', 'holds', 'with', 'such', 'nonsence']

In [4]:

# 케라스 토큰화: 모든 알파벳을 소문자로 바꾸면서 마침표나 컴마, 느낌표 등의 구두점을 제거
print("word_token_3 : ", text_to_word_sequence("They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't holds with such nonsence"))

word_token_3 :  ['they', 'were', 'the', 'last', 'people', "you'd", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', "didn't", 'holds', 'with', 'such', 'nonsence']

토큰화 주의사항

1) 구두점이나 특수 문자를 무조건 제거하면 안 된다.

마침표(.)와 같은 경우는 문장의 경계를 알 수 있는데 도움이 됩니다.
단어를 뽑아낼 때, 마침표(.)를 제외하지 않을 수 있습니다.

다른 예로 가격을 의미하거나 날짜를 표시할 경우 등 특수 문자가 쓰여 의미를 부여받는 경우를 고려해야합니다.

2) 줄임말과 단어 내에 띄어쓰기가 존재하는가
하나의 단어이지만 중간에 띄어쓰기가 존재하는 경우가 있습니다.
가령 New York이라는 단어가 있습니다.

3) 표준 토큰화

규칙1. 하이픈으로 구성된 단어는 하나로 유지한다.
규칙2. Don't와 같이 아포스트로피로 '접어'(축약된 의미)가 함께하는 단어는 분리해준다.

In [5]:

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

text = "What do I care about how he looks? I am good-looking enough for both of us, I think! All these scars show that my husband is brave!"

print('Tokenization by Penn Treebank : ', tokenizer.tokenize(text))

Tokenization by Penn Treebank :  ['What', 'do', 'I', 'care', 'about', 'how', 'he', 'looks', '?', 'I', 'am', 'good-looking', 'enough', 'for', 'both', 'of', 'us', ',', 'I', 'think', '!', 'All', 'these', 'scars', 'show', 'that', 'my', 'husband', 'is', 'brave', '!']

Sentence Tokenization¶

In [6]:

from nltk.tokenize import sent_tokenize #  영어 문장의 토큰화

text = """ What are you talking about? 'One woman'? That's like saying there's only one flavor of ice cream for
you. Lemme tell you something, Ross. There's lots of flavors out there. There's Rocky Road, and Cookie
Dough, and Bing! Cherry Vanilla. You could get 'em with Jimmies, or nuts, or whipped cream! This is the
best thing that ever happened to you! You got married, you were, like, what, eight? Welcome back to the
world! Grab a spoon! """

print('sentence_toknization_1 : ',sent_tokenize(text))

sentence_toknization_1 :  [' What are you talking about?', "'One woman'?", "That's like saying there's only one flavor of ice cream for\nyou.", 'Lemme tell you something, Ross.', "There's lots of flavors out there.", "There's Rocky Road, and Cookie\nDough, and Bing!", 'Cherry Vanilla.', "You could get 'em with Jimmies, or nuts, or whipped cream!", 'This is the\nbest thing that ever happened to you!', 'You got married, you were, like, what, eight?', 'Welcome back to the\nworld!', 'Grab a spoon!']

In [10]:

# pip install kss

In [9]:

import kss # 한국어 문장 토큰화

In [11]:

text = '딥 러닝 자연어 처리가 신비롭게 느껴집니다. 하지만 문제는 영어보다 한국어로 할 때 너무 어렵습니다. 설마 포기하진 않겠죠?'
print('한국어 문장 토큰화 :',kss.split_sentences(text))

[Kss]: Because there's no supported C++ morpheme analyzer, Kss will take pecab as a backend. :D
For your information, Kss also supports mecab backend.
We recommend you to install mecab or konlpy.tag.Mecab for faster execution of Kss.
Please refer to following web sites for details:
- mecab: https://cleancode-ws.tistory.com/97
- konlpy.tag.Mecab: https://uwgdqo.tistory.com/363

한국어 문장 토큰화 : ['딥 러닝 자연어 처리가 신비롭게 느껴집니다.', '하지만 문제는 영어보다 한국어로 할 때 너무 어렵습니다.', '설마 포기하진 않겠죠?']

Part-of-speech tagging¶

품사 태깅

단어의 표기가 같지만 의미가 달라지는 단어에 대한 구분

가령 '안' 이라는 단어는 공간 적인 안이라는 의미와 '안 먹다' 처럼 부정적인 의미도 있다.

In [12]:

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

PRP는 인칭 대명사
VBP는 동사
RB는 부사
VBG는 현재부사
IN은 전치사
NNP는 고유 명사
NNS는 복수형 명사
CC는 접속사
DT는 관사

In [14]:

text = "The only way to overcome a slump is to prepare harder."
tokenized_sentence = word_tokenize(text)

print('word_tokenization :', tokenized_sentence)
print('tagging :', pos_tag(tokenized_sentence))

word_tokenization : ['The', 'only', 'way', 'to', 'overcome', 'a', 'slump', 'is', 'to', 'prepare', 'harder', '.']
tagging : [('The', 'DT'), ('only', 'JJ'), ('way', 'NN'), ('to', 'TO'), ('overcome', 'VB'), ('a', 'DT'), ('slump', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('prepare', 'VB'), ('harder', 'NN'), ('.', '.')]

한국어 자연어 처리(KoNLPy)

형태소 분석기로 Okt(Open Korea Text), 메캅(Mecab), 코모란(Komoran), 한나눔(Hannanum), 꼬꼬마(Kkma)가 있습니다.

In [15]:

from konlpy.tag import Okt
from konlpy.tag import Kkma

형태소 분석기로 토큰화를 시도

morphs : 형태소 추출
pos : 품사 태깅(Part-of-speech tagging)
nouns : 명사 추출

In [16]:

okt = Okt()
kkma = Kkma()

In [17]:

print('OKT 형태소 분석 :', okt.morphs("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))
print('OKT 품사 태깅 :', okt.pos("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))
print('OKT 명사 추출 :', okt.nouns("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))

OKT 형태소 분석 : ['열심히', '코딩', '한', '당신', ',', '좋은', '결과', '가', '있을', '겁니다', '.']
OKT 품사 태깅 : [('열심히', 'Adverb'), ('코딩', 'Noun'), ('한', 'Josa'), ('당신', 'Noun'), (',', 'Punctuation'), ('좋은', 'Adjective'), ('결과', 'Noun'), ('가', 'Josa'), ('있을', 'Adjective'), ('겁니다', 'Verb'), ('.', 'Punctuation')]
OKT 명사 추출 : ['코딩', '당신', '결과']

In [18]:

print('kkma 형태소 분석 :', kkma.morphs("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))
print('kkma 품사 태깅 :', kkma.pos("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))
print('kkma 명사 추출 :', kkma.nouns("열심히 코딩한 당신, 좋은 결과가 있을 겁니다."))

kkma 형태소 분석 : ['열심히', '코딩', '하', 'ㄴ', '당신', ',', '좋', '은', '결과', '가', '있', '을', '것', '이', 'ㅂ니다', '.']
kkma 품사 태깅 : [('열심히', 'MAG'), ('코딩', 'NNG'), ('하', 'XSV'), ('ㄴ', 'ETD'), ('당신', 'NP'), (',', 'SP'), ('좋', 'VA'), ('은', 'ETD'), ('결과', 'NNG'), ('가', 'JKS'), ('있', 'VV'), ('을', 'ETD'), ('것', 'NNB'), ('이', 'VCP'), ('ㅂ니다', 'EFN'), ('.', 'SF')]
kkma 명사 추출 : ['코딩', '당신', '결과']

728x90

'Data Analytics with python > [Natural Language]' 카테고리의 다른 글

[정규 표현식] regex 2편 (0)	2023.01.28
[정규 표현식] regex 1편 (0)	2023.01.28
[텍스트 처리]자연어 처리를 위한 라이브러리 setting (0)	2023.01.28

Kang's Note

[텍스트 전처리] 1.Tokenization (토큰화)

텍스트 전처리¶

Tokenization¶

Word Tokenization¶

Sentence Tokenization¶

Part-of-speech tagging¶

'Data Analytics with python > [Natural Language]' 카테고리의 다른 글

댓글

티스토리툴바

[텍스트 전처리] 1.Tokenization (토큰화)

텍스트 전처리¶

Tokenization¶

Word Tokenization¶

Sentence Tokenization¶

Part-of-speech tagging¶

'Data Analytics with python > [Natural Language]' 카테고리의 다른 글

관련글

댓글

티스토리툴바