GroupIWord2Vec.WordEmbeddingType
WordEmbedding

A structure for storing and managing word embeddings, where each word is associated with a vector representation.

Fields

  • words::Vector{String}: List of all words in the vocabulary
  • embeddings::Matrix{Float64}: Matrix where each column is a word's vector representation
  • word_indices::Dict{String, Int}: Dictionary mapping words to their positions in the vocabulary

Constructor

WordEmbedding(words::Vector{String}, matrix::Matrix{Float64})

Creates a WordEmbedding with the given vocabulary and corresponding vectors.

Throws

  • ArgumentError: If the number of words doesn't match the number of vectors (matrix columns)

Example

# Create a simple word embedding with 2D vectors
words = ["cat", "dog", "house"]
vectors = [0.5 0.1 0.8;
          0.2 0.9 0.3]
embedding = WordEmbedding(words, vectors)
source
GroupIWord2Vec.create_custom_modelMethod
create_custom_model(embedding_dim::Int, vocabulary_lenght::Int)::Chain

Creates a Flux model for CBOW.

Arguments

  • embedding_dim::Int: The wanted dimensionality of the embedding. 10-300 is recommended depending on the complexity and resources.
  • vocabulary_lenght::Int: Number of words in the vocabulary

Returns

  • Chain: A Flux chain with softmax output

Notes

  • Chain can be used like this my model([2, 5, 18 12]) -> returns prediction of word with the context [2, 5, 18 12] as Softmax probability.

Example

my_model = create_custom_model(50, length(my_vocabulary)) 
source
GroupIWord2Vec.create_vocabularyMethod
create_vocabulary(path::String)::Dict{String, Int}

Creates a vocabulary from a textfile with all occuring words.

Arguments

  • path::String: Path to the textfile as string

Returns

  • Dict{String, Int}: A dictionary with the words and coresponding indices

Example

my_vocabulary = create_vocabulary("data/mydataset.txt")
source
GroupIWord2Vec.get_any2vecMethod
get_any2vec(wv::WordEmbedding, word_or_vec::Union{String, Vector{Float64}}) -> Vector{Float64}

Converts a word into its corresponding vector representation or returns the vector unchanged if already provided

Arguments

  • wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
  • word_or_vec::Union{String, Vector{Float64}}: A word to be converted into a vector, or a numerical vector to be validated

Returns

  • Vector{Float64}: The vector representation of the word if input is a String, or the validated vector

Throws

  • DimensionMismatch: If the input vector does not match the embedding dimension.
  • ArgumentError: If the input is neither a word nor a valid numeric vector.

Example

words = ["cat", "dog"]
vectors = [0.5 0.1;
          0.2 0.9]
wv = WordEmbedding(words, vectors)

get_any2vec(wv, "cat")  # Returns [0.5, 0.2]
get_any2vec(wv, [0.5, 0.2])  # Returns [0.5, 0.2]
source
GroupIWord2Vec.get_similar_wordsFunction
get_similar_words(wv::WordEmbedding, word_or_vec::Union{AbstractString, AbstractVector{<:Real}}, n::Int=10) -> Vector{String}

Finds the n most similar words to a given word or vector based on cosine similarity.

Arguments

  • wv: The word embedding model.
  • word_or_vec: The target word or embedding vector.
  • n: Number of similar words to return (default: 10).

Throws

  • ArgumentError: If n is not positive, the word is missing, or the vector has zero norm.
  • DimensionMismatch: If the vector size is incorrect.

Returns

A list of n most similar words, sorted by similarity.

Example

get_similar_words(model, "cat", 5)  # ["dog", "kitten", "feline", "puppy", "pet"]
get_similar_words(model, get_word2vec(model, "ocean"), 3)  # ["sea", "water", "wave"]
source
GroupIWord2Vec.get_vec2wordMethod
get_vec2word(wv::WordEmbedding,vec::Vector{Float64}) -> String

Retrieves the closest word in the embedding space to a given vector based on cosine similarity.

Arguments

  • wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
  • vec::Vector{Float64}: A vector representation of a word

Returns

  • String: The word from the vocabulary closest to the given vector

Throws

  • DimensionMismatch: If the input vector's dimension does not match the word vector dimensions

Example

words = ["cat", "dog"]
vectors = [0.5 0.1;
          0.2 0.9]
embedding = WordEmbedding(words, vectors)

get_vec2word(embedding, [0.51, 0.19])  # Returns "cat"
source
GroupIWord2Vec.get_vector_operationMethod
get_vector_operation(ww::WordEmbedding, inp1::Union{String, Vector{Float64}}, inp2::Union{String, Vector{Float64}}, operator::Symbol) -> Union{Vector{Float64}, Float64}

Performs a mathematical operation between two word embedding vectors

Arguments

  • ww::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
  • inp1::Union{String, Vector{Float64}}: The first input, which can be a word (String) or a precomputed embedding vector
  • inp2::Union{String, Vector{Float64}}: The second input, which can be a word (String) or a precomputed embedding vector
  • operator::Symbol: The operation to perform. Must be one of :+, :-, :cosine, or :euclid

Throws

  • ArgumentError: If the operator is invalid.
  • ArgumentError: If cosine similarity is attempted on a zero vector
  • DimensionMismatch: If the input vectors do not have the same length

Returns

  • Vector{Float64}: If the operation is :+ or :-, returns the resulting word vector
  • Float64: If the operation is :cosine or :euclid, returns a scalar value

Example

vec = get_vector_operation(model, "king", "man", :-)
similarity = get_vector_operation(model, "cat", "dog", :cosine)
distance = get_vector_operation(model, "car", "bicycle", :euclid)
source
GroupIWord2Vec.get_word2vecMethod
get_word2vec(wv::WordEmbedding, word::String) -> Vector{Float64}

Retrieves the embedding vector corresponding to a given word.

Arguments

  • wv::WordEmbedding: The word embedding structure containing the vocabulary and embeddings
  • word::String: The word to look up

Throws

  • ArgumentError: If the word is not found in the embedding model

Returns

  • Vector{Float64}: The embedding vector of the requested word of type Float64

Example

vec = get_word2vec(model, "dog")
source
GroupIWord2Vec.get_word_analogyFunction
get_word_analogy(wv::WordEmbedding, inp1::T, inp2::T, inp3::T, n::Int=5) where {T<:Union{AbstractString, AbstractVector{<:Real}}} -> Vector{String}

Finds the top n words that best complete the analogy: inp1 - inp2 + inp3 = ?.

Arguments

  • wv::WordEmbedding: The word embedding model.
  • inp1, inp2, inp3::T: Words or vectors for analogy computation.
  • n::Int=5: Number of closest matching words to return.

Returns

  • Vector{String}: A list of the top n matching words.

Notes

  • Input words are converted to vectors automatically.
  • The computed analogy vector is normalized.

Example

get_word_analogy(model, "king", "man", "woman", 3) 
# → ["queen", "princess", "duchess"]
source
GroupIWord2Vec.load_embeddingsMethod
load_embeddings(path::String; format::Symbol=:text, data_type::Type{Float64}=Float64, separator::Char=' ', skip_bytes::Int=1)

Loads word embeddings from a text or binary file.

Arguments

  • path::String: Path to the embedding file
  • format::Union{:text, :binary}=:text: File format (:text or :binary)
  • data_type::Type{Float64}=Float64: Type of word vectors
  • separator::Char=' ': Word-vector separator in text files
  • skip_bytes::Int=0: Bytes to skip after each word-vector pair in binary files

Throws

  • ArgumentError: If format is not :text or :binary

Returns

  • WordEmbedding: The loaded word embeddings structure

Example

embedding = load_embeddings("vectors.txt")  # Load text format
embedding = load_embeddings("vectors.bin", format=:binary, data_type=Float64, skip_bytes=1)  # Load binary format
source
GroupIWord2Vec.read_binary_formatMethod

" readbinaryformat(filepath::AbstractString, ::Type{T}, normalize::Bool, separator::Char, skip_bytes::Int) where T<:Real -> WordEmbedding

Reads word embeddings from a binary file and converts them into a WordEmbedding object.

Arguments

  • filepath::AbstractString: Path to the binary file containing word embeddings.
  • T<:Real: Numeric type for storing embedding values (e.g., Float32, Float64).
  • normalize::Bool: Whether to normalize vectors to unit length for comparison.
  • separator::Char: Character separating words and vector data in the file.
  • skip_bytes::Int: Number of bytes to skip after each word-vector pair (e.g., for handling separators).

Throws

  • SystemError: If the file cannot be opened or read.
  • ArgumentError: If the file format is incorrect or data is missing.

Returns

  • WordEmbedding: A structure containing words and their corresponding embedding vectors.

Example

embeddings = read_binary_format("vectors.bin", Float32, true, ' ', 1)
source
GroupIWord2Vec.read_text_formatMethod
read_text_format(filepath::AbstractString, ::Type{T}, normalize::Bool, 
                 separator::Char) where T<:Real -> WordEmbedding

Reads word embeddings from a text file and converts them into a WordEmbedding object.

Arguments

  • filepath::AbstractString: Path to the text file containing word embeddings.
  • T<:Real: Numeric type for storing embedding values (e.g., Float32, Float64).
  • normalize::Bool: Whether to normalize vectors to unit length for comparison.
  • separator::Char: Character used to separate words and vector values in the file.

Throws

  • SystemError: If the file cannot be opened or read.
  • ArgumentError: If the file format is incorrect or missing data.

Returns

  • WordEmbedding: A structure containing words and their corresponding embedding vectors.

Example

embeddings = read_text_format("vectors.txt", Float32, true, ' ')
source
GroupIWord2Vec.reduce_to_2dFunction
reduce_to_2d(data::Matrix{Float64}, number_of_pc::Int=2) -> Matrix{Float64}

Performs Principal Component Analysis (PCA) to reduce the dimensionality of a given dataset and returns a projected data

Arguments

  • data::Matrix{Float64}: The input data matrix where rows represent samples and columns represent features.
  • number_of_pc::Int=2: The number of principal components to retain (default: 2).

Returns

  • Matrix{Float64}: A matrix of shape (number_of_pc × N), where N is the number of samples, containing the projected data in the reduced dimensional space.

Example

data = randn(100, 50)  # 100 samples, 50 features
reduced_data = reduce_to_2d(data, 2)
source
GroupIWord2Vec.save_custom_modelMethod
save_custom_model(model::Chain, vocabulary::Dict{String, Int}, path::String)

Saves the model as a txt in the format for load_embeddings().

Arguments

  • model::Chain: The Flux chain from create_model.
  • vocabulary::Dict: The vocabulary from create_vocabulary()
  • path::String: Path to the file for saving.

Notes

  • Make sure to choose a file with a .txt ending if you plan to use it with load_embeddings().

Example

save_custom_model(my_model, my_vocabulary, "data/saved_embedd.txt")
source
GroupIWord2Vec.sequence_textMethod
sequence_text(path::String, vocabulary::Dict{String, Int})::Vector{Int64}

Transforms a text to a vector of indices that match the words in the vocabulary.

Arguments

  • path::String: Path to the textfile as string
  • vocabulary::Dict{String, Int}: Vocabulary as a look up table

Returns

  • Vector{Int64}: A vector of Integers that contains the text in index form eg: [1, 5, 23, 99, 69, ...]

Example

sequence = sequence_text("data/mydataset.txt", my_vocabulary)
source
GroupIWord2Vec.show_relationsMethod
show_relations(words::String...; wv::WordEmbedding, save_path::String="word_relations.png") -> Plots.Plot

Generates a 2D PCA projection of the given word embeddings and visualizes their relationships like this: arg1==>arg2, arg3==>arg4, ... Note: Use an even number of inputs!

Arguments

  • words::String...: A list of words to visualize. The number of words must be a multiple of 2.
  • wv::WordEmbedding: The word embedding structure containing the word vectors.
  • save_path::String="word_relations.png": The file path for the generated plot. Not saved if empty or nothing

Throws

  • ArgumentError: If the number of words is not a multiple of 2.
  • ArgumentError: If any of the provided words are not found in the embedding model.

Returns

  • Plots.Plot: A scatter plot with arrows representing word relationships.

Example

p = show_relations("king", "queen", "man", "woman"; wv=model, save_path="relations.png")
source
GroupIWord2Vec.train_custom_modelMethod
train_custom_model(model::Chain, dataset::String, vocabulary::Dict ,epochs::Int, window_size::Int; optimizer=Descent(), batchsize=10)::Chain

Trains a model on a given dataset.

Arguments

  • model::Chain: The Flux chain from create_model.
  • dataset::String: Path to the dataset.
  • vocabulary::Dict: The vocabulary from create_vocabulary()
  • epochs::Int: Number of desired epochs.
  • window_size::Int: Window size for the context window. The total window is 2*window size because preceding and following words are used as context.
  • optimizer=Descent(): Optimizer from Flux used for training
  • batchsize=10: Number of words trained per epoch. If batchsize = 0 all words in dataset get used once per epoch.

Returns

  • Chain: The updated Flux Chain after training.

Notes

  • Number of words Trained = epchs*batchsize.

Example

my_updated_model = train_custom_model(my_model, "data/my_dataset.txt", my_vocabulary, 10, 1)
source
GroupIWord2Vec.train_modelMethod
train_model(train::AbstractString, output::AbstractString; 
            size::Int=100, window::Int=5, sample::AbstractFloat=1e-3,
            hs::Int=0, negative::Int=5, threads::Int=12, iter::Int=5, 
            min_count::Int=5, alpha::AbstractFloat=0.025,
            debug::Int=2, binary::Int=0, cbow::Int=1, 
            save_vocab=Nothing(), read_vocab=Nothing(),
            verbose::Bool=false) -> Nothing

Trains a Word2Vec model using the specified parameters.

CAUTION!

This Function can only be used with Linux or MacOS operating systems! MacOS is only supported with Intel processors M1, M2 are not supported!

Arguments

  • train::AbstractString: Path to the input text file used for training.
  • output::AbstractString: Path to save the trained word vectors.
  • size::Int: Dimensionality of the word vectors (default: 100).
  • window::Int: Maximum skip length between words (default: 5).
  • sample::AbstractFloat: Threshold for word occurrence downsampling (default: 1e-3).
  • hs::Int: Use hierarchical softmax (1 = enabled, 0 = disabled, default: 0).
  • negative::Int: Number of negative samples (0 = disabled, common values: 5-10, default: 5).
  • threads::Int: Number of threads for training (default: 12).
  • iter::Int: Number of training iterations (default: 5).
  • min_count::Int: Minimum occurrences for a word to be included (default: 5).
  • alpha::AbstractFloat: Initial learning rate (default: 0.025).
  • debug::Int: Debugging verbosity level (default: 2).
  • binary::Int: Save the vectors in binary format (1 = enabled, 0 = disabled, default: 0).
  • cbow::Int: Use continuous bag-of-words model (1 = CBOW, 0 = Skip-gram, default: 1).
  • save_vocab: Path to save the vocabulary (default: Nothing()).
  • read_vocab: Path to read an existing vocabulary (default: Nothing()).
  • verbose::Bool: Print training progress (default: false).

Throws

  • SystemError: If the training process encounters an issue with file paths.
  • ArgumentError: If input parameters are invalid.

Returns

  • Nothing: The function trains the model and saves the output to a file.

Example

train_model("data.txt", "model.vec"; size=200, window=10, iter=10)
source