Custom Tokenizer by ChatGPT

by **hbyte** » Mon Dec 16, 2024 6:41 pm

Code: Select all: class Vocab: def __init__(self, texts, vocab_size=10000): self.vocab_size = vocab_size self.word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3} self.idx2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"] counter = Counter(" ".join(texts).split()) for word, _ in counter.most_common(vocab_size - len(self.word2idx)): self.word2idx[word] = len(self.idx2word) self.idx2word.append(word) def encode(self, text, max_len): tokens = text.split() return [self.word2idx.get(token, self.word2idx["<UNK>"]) for token in tokens[:max_len]] + [self.word2idx["<PAD>"]] * (max_len - len(tokens)) def decode(self, indices): return " ".join([self.idx2word[idx] for idx in indices if idx not in {0, 1, 2}])

SpaCy: A fast and efficient tokenizer that is part of the SpaCy library, which is widely used for various NLP tasks.

Word2Vec: Although primarily known for word embeddings, Word2Vec also includes a tokenizer that processes text into tokens.

WordPiece: Used in models like BERT, it breaks down words into subword units to handle out-of-vocabulary words effectively.

Byte-Pair Encoding (BPE): Used in models like GPT, it merges the most frequent pairs of bytes to create subword units.

SentencePiece: A tokenizer that can handle languages without spaces, like Chinese and Japanese, and is used in models like T5 and ALBERT.

Unigram: A tokenizer that starts with a large vocabulary and iteratively removes the least likely tokens, used in models like SentencePiec

N-gram: A tokenizer that breaks text into contiguous sequences of n items (words, characters, etc.), commonly used in statistical language models and text analysis

Custom Tokenizer by ChatGPT

Custom Tokenizer by ChatGPT

Who is online