- Code: Select all
class Vocab:
def __init__(self, texts, vocab_size=10000):
self.vocab_size = vocab_size
self.word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
self.idx2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"]
counter = Counter(" ".join(texts).split())
for word, _ in counter.most_common(vocab_size - len(self.word2idx)):
self.word2idx[word] = len(self.idx2word)
self.idx2word.append(word)
def encode(self, text, max_len):
tokens = text.split()
return [self.word2idx.get(token, self.word2idx["<UNK>"]) for token in tokens[:max_len]] + [self.word2idx["<PAD>"]] * (max_len - len(tokens))
def decode(self, indices):
return " ".join([self.idx2word[idx] for idx in indices if idx not in {0, 1, 2}])
SpaCy: A fast and efficient tokenizer that is part of the SpaCy library, which is widely used for various NLP tasks.
Word2Vec: Although primarily known for word embeddings, Word2Vec also includes a tokenizer that processes text into tokens.
WordPiece: Used in models like BERT, it breaks down words into subword units to handle out-of-vocabulary words effectively.
Byte-Pair Encoding (BPE): Used in models like GPT, it merges the most frequent pairs of bytes to create subword units.
SentencePiece: A tokenizer that can handle languages without spaces, like Chinese and Japanese, and is used in models like T5 and ALBERT.
Unigram: A tokenizer that starts with a large vocabulary and iteratively removes the least likely tokens, used in models like SentencePiec
N-gram: A tokenizer that breaks text into contiguous sequences of n items (words, characters, etc.), commonly used in statistical language models and text analysis