All Functions List · GroupIWord2Vec.jl

GroupIWord2Vec.WordEmbedding — Type

WordEmbedding

A structure for storing and managing word embeddings, where each word is associated with a vector representation.

Fields

words::Vector{String}: List of all words in the vocabulary
embeddings::Matrix{Float64}: Matrix where each column is a word's vector representation
word_indices::Dict{String, Int}: Dictionary mapping words to their positions in the vocabulary

Constructor

WordEmbedding(words::Vector{String}, matrix::Matrix{Float64})

Creates a WordEmbedding with the given vocabulary and corresponding vectors.

Throws

ArgumentError: If the number of words doesn't match the number of vectors (matrix columns)

Example

# Create a simple word embedding with 2D vectors
words = ["cat", "dog", "house"]
vectors = [0.5 0.1 0.8;
          0.2 0.9 0.3]
embedding = WordEmbedding(words, vectors)

source

GroupIWord2Vec.create_custom_model — Method

create_custom_model(embedding_dim::Int, vocabulary_lenght::Int)::Chain

Creates a Flux model for CBOW.

Arguments

embedding_dim::Int: The wanted dimensionality of the embedding. 10-300 is recommended depending on the complexity and resources.
vocabulary_lenght::Int: Number of words in the vocabulary

Returns

Chain: A Flux chain with softmax output

Notes

Chain can be used like this my model([2, 5, 18 12]) -> returns prediction of word with the context [2, 5, 18 12] as Softmax probability.

Example

my_model = create_custom_model(50, length(my_vocabulary))

source

GroupIWord2Vec.create_vocabulary — Method

create_vocabulary(path::String)::Dict{String, Int}

Creates a vocabulary from a textfile with all occuring words.

Arguments

path::String: Path to the textfile as string

Returns

Dict{String, Int}: A dictionary with the words and coresponding indices

Example

my_vocabulary = create_vocabulary("data/mydataset.txt")

source

GroupIWord2Vec.get_any2vec — Method

get_any2vec(wv::WordEmbedding, word_or_vec::Union{String, Vector{Float64}}) -> Vector{Float64}

Converts a word into its corresponding vector representation or returns the vector unchanged if already provided

Arguments

wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
word_or_vec::Union{String, Vector{Float64}}: A word to be converted into a vector, or a numerical vector to be validated

Returns

Vector{Float64}: The vector representation of the word if input is a String, or the validated vector

Throws

DimensionMismatch: If the input vector does not match the embedding dimension.
ArgumentError: If the input is neither a word nor a valid numeric vector.

Example

words = ["cat", "dog"]
vectors = [0.5 0.1;
          0.2 0.9]
wv = WordEmbedding(words, vectors)

get_any2vec(wv, "cat")  # Returns [0.5, 0.2]
get_any2vec(wv, [0.5, 0.2])  # Returns [0.5, 0.2]

source

GroupIWord2Vec.get_similar_words — Function

get_similar_words(wv::WordEmbedding, word_or_vec::Union{AbstractString, AbstractVector{<:Real}}, n::Int=10) -> Vector{String}

Finds the n most similar words to a given word or vector based on cosine similarity.

Arguments

wv: The word embedding model.
word_or_vec: The target word or embedding vector.
n: Number of similar words to return (default: 10).

Throws

ArgumentError: If n is not positive, the word is missing, or the vector has zero norm.
DimensionMismatch: If the vector size is incorrect.

Returns

A list of n most similar words, sorted by similarity.

Example

get_similar_words(model, "cat", 5)  # ["dog", "kitten", "feline", "puppy", "pet"]
get_similar_words(model, get_word2vec(model, "ocean"), 3)  # ["sea", "water", "wave"]

source

GroupIWord2Vec.get_vec2word — Method

get_vec2word(wv::WordEmbedding,vec::Vector{Float64}) -> String

Retrieves the closest word in the embedding space to a given vector based on cosine similarity.

Arguments

wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
vec::Vector{Float64}: A vector representation of a word

Returns

String: The word from the vocabulary closest to the given vector

Throws

DimensionMismatch: If the input vector's dimension does not match the word vector dimensions

Example

words = ["cat", "dog"]
vectors = [0.5 0.1;
          0.2 0.9]
embedding = WordEmbedding(words, vectors)

get_vec2word(embedding, [0.51, 0.19])  # Returns "cat"

source

GroupIWord2Vec.get_vector_operation — Method

get_vector_operation(ww::WordEmbedding, inp1::Union{String, Vector{Float64}}, inp2::Union{String, Vector{Float64}}, operator::Symbol) -> Union{Vector{Float64}, Float64}

Performs a mathematical operation between two word embedding vectors

Arguments

ww::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
inp1::Union{String, Vector{Float64}}: The first input, which can be a word (String) or a precomputed embedding vector
inp2::Union{String, Vector{Float64}}: The second input, which can be a word (String) or a precomputed embedding vector
operator::Symbol: The operation to perform. Must be one of :+, :-, :cosine, or :euclid

Throws

ArgumentError: If the operator is invalid.
ArgumentError: If cosine similarity is attempted on a zero vector
DimensionMismatch: If the input vectors do not have the same length

Returns

Vector{Float64}: If the operation is :+ or :-, returns the resulting word vector
Float64: If the operation is :cosine or :euclid, returns a scalar value

Example

vec = get_vector_operation(model, "king", "man", :-)
similarity = get_vector_operation(model, "cat", "dog", :cosine)
distance = get_vector_operation(model, "car", "bicycle", :euclid)

source

GroupIWord2Vec.get_word2vec — Method

get_word2vec(wv::WordEmbedding, word::String) -> Vector{Float64}

Retrieves the embedding vector corresponding to a given word.

Arguments

wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
word::String: The word to look up

Throws

ArgumentError: If the word is not found in the embedding model

Returns

Vector{Float64}: The embedding vector of the requested word of type Float64

Example

vec = get_word2vec(model, "dog")

source

GroupIWord2Vec.get_word_analogy — Function

get_word_analogy(wv::WordEmbedding, inp1::T, inp2::T, inp3::T, n::Int=5) where {T<:Union{AbstractString, AbstractVector{<:Real}}} -> Vector{String}

Finds the top n words that best complete the analogy: inp1 - inp2 + inp3 = ?.

Arguments

wv::WordEmbedding: The word embedding model.
inp1, inp2, inp3::T: Words or vectors for analogy computation.
n::Int=5: Number of closest matching words to return.

Returns

Vector{String}: A list of the top n matching words.

Notes

Input words are converted to vectors automatically.
The computed analogy vector is normalized.

Example

get_word_analogy(model, "king", "man", "woman", 3) 
# → ["queen", "princess", "duchess"]

source

GroupIWord2Vec.load_embeddings — Method

load_embeddings(path::String; format::Symbol=:text, data_type::Type{Float64}=Float64, separator::Char=' ', skip_bytes::Int=1)

Loads word embeddings from a text or binary file.

Arguments

path::String: Path to the embedding file
format::Union{:text, :binary}=:text: File format (:text or :binary)
data_type::Type{Float64}=Float64: Type of word vectors
separator::Char=' ': Word-vector separator in text files
skip_bytes::Int=0: Bytes to skip after each word-vector pair in binary files

Throws

ArgumentError: If format is not :text or :binary

Returns

WordEmbedding: The loaded word embeddings structure

Example

embedding = load_embeddings("vectors.txt")  # Load text format
embedding = load_embeddings("vectors.bin", format=:binary, data_type=Float64, skip_bytes=1)  # Load binary format

source

GroupIWord2Vec.read_binary_format — Method

" readbinaryformat(filepath::AbstractString, ::Type{T}, normalize::Bool, separator::Char, skip_bytes::Int) where T<:Real -> WordEmbedding

Reads word embeddings from a binary file and converts them into a WordEmbedding object.

Arguments

filepath::AbstractString: Path to the binary file containing word embeddings.
T<:Real: Numeric type for storing embedding values (e.g., Float32, Float64).
normalize::Bool: Whether to normalize vectors to unit length for comparison.
separator::Char: Character separating words and vector data in the file.
skip_bytes::Int: Number of bytes to skip after each word-vector pair (e.g., for handling separators).

Throws

SystemError: If the file cannot be opened or read.
ArgumentError: If the file format is incorrect or data is missing.

Returns

WordEmbedding: A structure containing words and their corresponding embedding vectors.

Example

embeddings = read_binary_format("vectors.bin", Float32, true, ' ', 1)

source

GroupIWord2Vec.read_text_format — Method

read_text_format(filepath::AbstractString, ::Type{T}, normalize::Bool, 
                 separator::Char) where T<:Real -> WordEmbedding

Reads word embeddings from a text file and converts them into a WordEmbedding object.

Arguments

filepath::AbstractString: Path to the text file containing word embeddings.
T<:Real: Numeric type for storing embedding values (e.g., Float32, Float64).
normalize::Bool: Whether to normalize vectors to unit length for comparison.
separator::Char: Character used to separate words and vector values in the file.

Throws

SystemError: If the file cannot be opened or read.
ArgumentError: If the file format is incorrect or missing data.

Returns

WordEmbedding: A structure containing words and their corresponding embedding vectors.

Example

embeddings = read_text_format("vectors.txt", Float32, true, ' ')

source

GroupIWord2Vec.reduce_to_2d — Function

reduce_to_2d(data::Matrix{Float64}, number_of_pc::Int=2) -> Matrix{Float64}

Performs Principal Component Analysis (PCA) to reduce the dimensionality of a given dataset and returns a projected data

Arguments

data::Matrix{Float64}: The input data matrix where rows represent samples and columns represent features.
number_of_pc::Int=2: The number of principal components to retain (default: 2).

Returns

Matrix{Float64}: A matrix of shape (number_of_pc × N), where N is the number of samples, containing the projected data in the reduced dimensional space.

Example

data = randn(100, 50)  # 100 samples, 50 features
reduced_data = reduce_to_2d(data, 2)

source

GroupIWord2Vec.save_custom_model — Method

save_custom_model(model::Chain, vocabulary::Dict{String, Int}, path::String)

Saves the model as a txt in the format for load_embeddings().

Arguments

model::Chain: The Flux chain from create_model.
vocabulary::Dict: The vocabulary from create_vocabulary()
path::String: Path to the file for saving.

Notes

Make sure to choose a file with a .txt ending if you plan to use it with load_embeddings().

Example

save_custom_model(my_model, my_vocabulary, "data/saved_embedd.txt")

source

GroupIWord2Vec.sequence_text — Method

sequence_text(path::String, vocabulary::Dict{String, Int})::Vector{Int64}

Transforms a text to a vector of indices that match the words in the vocabulary.

Arguments

path::String: Path to the textfile as string
vocabulary::Dict{String, Int}: Vocabulary as a look up table

Returns

Vector{Int64}: A vector of Integers that contains the text in index form eg: [1, 5, 23, 99, 69, ...]

Example

sequence = sequence_text("data/mydataset.txt", my_vocabulary)

source

GroupIWord2Vec.show_relations — Method

show_relations(words::String...; wv::WordEmbedding, save_path::String="word_relations.png") -> Plots.Plot

Generates a 2D PCA projection of the given word embeddings and visualizes their relationships like this: arg1==>arg2, arg3==>arg4, ... Note: Use an even number of inputs!

Arguments

words::String...: A list of words to visualize. The number of words must be a multiple of 2.
wv::WordEmbedding: The word embedding structure containing the word vectors.
save_path::String="word_relations.png": The file path for the generated plot. Not saved if empty or nothing

Throws

ArgumentError: If the number of words is not a multiple of 2.
ArgumentError: If any of the provided words are not found in the embedding model.

Returns

Plots.Plot: A scatter plot with arrows representing word relationships.

Example

p = show_relations("king", "queen", "man", "woman"; wv=model, save_path="relations.png")

source

GroupIWord2Vec.train_custom_model — Method

train_custom_model(model::Chain, dataset::String, vocabulary::Dict ,epochs::Int, window_size::Int; optimizer=Descent(), batchsize=10)::Chain

Trains a model on a given dataset.

Arguments

model::Chain: The Flux chain from create_model.
dataset::String: Path to the dataset.
vocabulary::Dict: The vocabulary from create_vocabulary()
epochs::Int: Number of desired epochs.
window_size::Int: Window size for the context window. The total window is 2*window size because preceding and following words are used as context.
optimizer=Descent(): Optimizer from Flux used for training
batchsize=10: Number of words trained per epoch. If batchsize = 0 all words in dataset get used once per epoch.

Returns

Chain: The updated Flux Chain after training.

Notes

Number of words Trained = epchs*batchsize.

Example

my_updated_model = train_custom_model(my_model, "data/my_dataset.txt", my_vocabulary, 10, 1)

source

GroupIWord2Vec.train_model — Method

train_model(train::AbstractString, output::AbstractString; 
            size::Int=100, window::Int=5, sample::AbstractFloat=1e-3,
            hs::Int=0, negative::Int=5, threads::Int=12, iter::Int=5, 
            min_count::Int=5, alpha::AbstractFloat=0.025,
            debug::Int=2, binary::Int=0, cbow::Int=1, 
            save_vocab=Nothing(), read_vocab=Nothing(),
            verbose::Bool=false) -> Nothing

Trains a Word2Vec model using the specified parameters.

CAUTION!

This Function can only be used with Linux or MacOS operating systems! MacOS is only supported with Intel processors M1, M2 are not supported!

Arguments

train::AbstractString: Path to the input text file used for training.
output::AbstractString: Path to save the trained word vectors.
size::Int: Dimensionality of the word vectors (default: 100).
window::Int: Maximum skip length between words (default: 5).
sample::AbstractFloat: Threshold for word occurrence downsampling (default: 1e-3).
hs::Int: Use hierarchical softmax (1 = enabled, 0 = disabled, default: 0).
negative::Int: Number of negative samples (0 = disabled, common values: 5-10, default: 5).
threads::Int: Number of threads for training (default: 12).
iter::Int: Number of training iterations (default: 5).
min_count::Int: Minimum occurrences for a word to be included (default: 5).
alpha::AbstractFloat: Initial learning rate (default: 0.025).
debug::Int: Debugging verbosity level (default: 2).
binary::Int: Save the vectors in binary format (1 = enabled, 0 = disabled, default: 0).
cbow::Int: Use continuous bag-of-words model (1 = CBOW, 0 = Skip-gram, default: 1).
save_vocab: Path to save the vocabulary (default: Nothing()).
read_vocab: Path to read an existing vocabulary (default: Nothing()).
verbose::Bool: Print training progress (default: false).

Throws

SystemError: If the training process encounters an issue with file paths.
ArgumentError: If input parameters are invalid.

Returns

Nothing: The function trains the model and saves the output to a file.

Example

train_model("data.txt", "model.vec"; size=200, window=10, iter=10)

source