GroupIWord2Vec.WordEmbedding
— TypeWordEmbedding
A structure for storing and managing word embeddings, where each word is associated with a vector representation.
Fields
words::Vector{String}
: List of all words in the vocabularyembeddings::Matrix{Float64}
: Matrix where each column is a word's vector representationword_indices::Dict{String, Int}
: Dictionary mapping words to their positions in the vocabulary
Constructor
WordEmbedding(words::Vector{String}, matrix::Matrix{Float64})
Creates a WordEmbedding with the given vocabulary and corresponding vectors.
Throws
ArgumentError
: If the number of words doesn't match the number of vectors (matrix columns)
Example
# Create a simple word embedding with 2D vectors
words = ["cat", "dog", "house"]
vectors = [0.5 0.1 0.8;
0.2 0.9 0.3]
embedding = WordEmbedding(words, vectors)
GroupIWord2Vec.create_custom_model
— Methodcreate_custom_model(embedding_dim::Int, vocabulary_lenght::Int)::Chain
Creates a Flux model for CBOW.
Arguments
embedding_dim::Int
: The wanted dimensionality of the embedding. 10-300 is recommended depending on the complexity and resources.vocabulary_lenght::Int
: Number of words in the vocabulary
Returns
Chain
: A Flux chain with softmax output
Notes
- Chain can be used like this my model([2, 5, 18 12]) -> returns prediction of word with the context [2, 5, 18 12] as Softmax probability.
Example
my_model = create_custom_model(50, length(my_vocabulary))
GroupIWord2Vec.create_vocabulary
— Methodcreate_vocabulary(path::String)::Dict{String, Int}
Creates a vocabulary from a textfile with all occuring words.
Arguments
path::String
: Path to the textfile as string
Returns
Dict{String, Int}
: A dictionary with the words and coresponding indices
Example
my_vocabulary = create_vocabulary("data/mydataset.txt")
GroupIWord2Vec.get_any2vec
— Methodget_any2vec(wv::WordEmbedding, word_or_vec::Union{String, Vector{Float64}}) -> Vector{Float64}
Converts a word into its corresponding vector representation or returns the vector unchanged if already provided
Arguments
wv::WordEmbedding
: The word embedding structure containing the vocabulary and embeddingsword_or_vec::Union{String, Vector{Float64}}
: A word to be converted into a vector, or a numerical vector to be validated
Returns
Vector{Float64}
: The vector representation of the word if input is aString
, or the validated vector
Throws
DimensionMismatch
: If the input vector does not match the embedding dimension.ArgumentError
: If the input is neither a word nor a valid numeric vector.
Example
words = ["cat", "dog"]
vectors = [0.5 0.1;
0.2 0.9]
wv = WordEmbedding(words, vectors)
get_any2vec(wv, "cat") # Returns [0.5, 0.2]
get_any2vec(wv, [0.5, 0.2]) # Returns [0.5, 0.2]
GroupIWord2Vec.get_similar_words
— Functionget_similar_words(wv::WordEmbedding, word_or_vec::Union{AbstractString, AbstractVector{<:Real}}, n::Int=10) -> Vector{String}
Finds the n
most similar words to a given word or vector based on cosine similarity.
Arguments
wv
: The word embedding model.word_or_vec
: The target word or embedding vector.n
: Number of similar words to return (default: 10).
Throws
ArgumentError
: Ifn
is not positive, the word is missing, or the vector has zero norm.DimensionMismatch
: If the vector size is incorrect.
Returns
A list of n
most similar words, sorted by similarity.
Example
get_similar_words(model, "cat", 5) # ["dog", "kitten", "feline", "puppy", "pet"]
get_similar_words(model, get_word2vec(model, "ocean"), 3) # ["sea", "water", "wave"]
GroupIWord2Vec.get_vec2word
— Methodget_vec2word(wv::WordEmbedding,vec::Vector{Float64}) -> String
Retrieves the closest word in the embedding space to a given vector based on cosine similarity.
Arguments
wv::WordEmbedding
: The word embedding structure containing the vocabulary and embeddingsvec::Vector{Float64}
: A vector representation of a word
Returns
String
: The word from the vocabulary closest to the given vector
Throws
DimensionMismatch
: If the input vector's dimension does not match the word vector dimensions
Example
words = ["cat", "dog"]
vectors = [0.5 0.1;
0.2 0.9]
embedding = WordEmbedding(words, vectors)
get_vec2word(embedding, [0.51, 0.19]) # Returns "cat"
GroupIWord2Vec.get_vector_operation
— Methodget_vector_operation(ww::WordEmbedding, inp1::Union{String, Vector{Float64}}, inp2::Union{String, Vector{Float64}}, operator::Symbol) -> Union{Vector{Float64}, Float64}
Performs a mathematical operation between two word embedding vectors
Arguments
ww::WordEmbedding
: The word embedding structure containing the vocabulary and embeddingsinp1::Union{String, Vector{Float64}}
: The first input, which can be a word (String) or a precomputed embedding vectorinp2::Union{String, Vector{Float64}}
: The second input, which can be a word (String) or a precomputed embedding vectoroperator::Symbol
: The operation to perform. Must be one of:+
,:-
,:cosine
, or:euclid
Throws
ArgumentError
: If the operator is invalid.ArgumentError
: If cosine similarity is attempted on a zero vectorDimensionMismatch
: If the input vectors do not have the same length
Returns
Vector{Float64}
: If the operation is:+
or:-
, returns the resulting word vectorFloat64
: If the operation is:cosine
or:euclid
, returns a scalar value
Example
vec = get_vector_operation(model, "king", "man", :-)
similarity = get_vector_operation(model, "cat", "dog", :cosine)
distance = get_vector_operation(model, "car", "bicycle", :euclid)
GroupIWord2Vec.get_word2vec
— Methodget_word2vec(wv::WordEmbedding, word::String) -> Vector{Float64}
Retrieves the embedding vector corresponding to a given word.
Arguments
wv::WordEmbedding
: The word embedding structure containing the vocabulary and embeddingsword::String
: The word to look up
Throws
ArgumentError
: If the word is not found in the embedding model
Returns
Vector{Float64}
: The embedding vector of the requested word of type Float64
Example
vec = get_word2vec(model, "dog")
GroupIWord2Vec.get_word_analogy
— Functionget_word_analogy(wv::WordEmbedding, inp1::T, inp2::T, inp3::T, n::Int=5) where {T<:Union{AbstractString, AbstractVector{<:Real}}} -> Vector{String}
Finds the top n
words that best complete the analogy: inp1 - inp2 + inp3 = ?
.
Arguments
wv::WordEmbedding
: The word embedding model.inp1, inp2, inp3::T
: Words or vectors for analogy computation.n::Int=5
: Number of closest matching words to return.
Returns
Vector{String}
: A list of the topn
matching words.
Notes
- Input words are converted to vectors automatically.
- The computed analogy vector is normalized.
Example
get_word_analogy(model, "king", "man", "woman", 3)
# → ["queen", "princess", "duchess"]
GroupIWord2Vec.load_embeddings
— Methodload_embeddings(path::String; format::Symbol=:text, data_type::Type{Float64}=Float64, separator::Char=' ', skip_bytes::Int=1)
Loads word embeddings from a text or binary file.
Arguments
path::String
: Path to the embedding fileformat::Union{:text, :binary}=:text
: File format (:text
or:binary
)data_type::Type{Float64}=Float64
: Type of word vectorsseparator::Char=' '
: Word-vector separator in text filesskip_bytes::Int=0
: Bytes to skip after each word-vector pair in binary files
Throws
ArgumentError
: Ifformat
is not:text
or:binary
Returns
WordEmbedding
: The loaded word embeddings structure
Example
embedding = load_embeddings("vectors.txt") # Load text format
embedding = load_embeddings("vectors.bin", format=:binary, data_type=Float64, skip_bytes=1) # Load binary format
GroupIWord2Vec.read_binary_format
— Method" readbinaryformat(filepath::AbstractString, ::Type{T}, normalize::Bool, separator::Char, skip_bytes::Int) where T<:Real -> WordEmbedding
Reads word embeddings from a binary file and converts them into a WordEmbedding
object.
Arguments
filepath::AbstractString
: Path to the binary file containing word embeddings.T<:Real
: Numeric type for storing embedding values (e.g.,Float32
,Float64
).normalize::Bool
: Whether to normalize vectors to unit length for comparison.separator::Char
: Character separating words and vector data in the file.skip_bytes::Int
: Number of bytes to skip after each word-vector pair (e.g., for handling separators).
Throws
SystemError
: If the file cannot be opened or read.ArgumentError
: If the file format is incorrect or data is missing.
Returns
WordEmbedding
: A structure containing words and their corresponding embedding vectors.
Example
embeddings = read_binary_format("vectors.bin", Float32, true, ' ', 1)
GroupIWord2Vec.read_text_format
— Methodread_text_format(filepath::AbstractString, ::Type{T}, normalize::Bool,
separator::Char) where T<:Real -> WordEmbedding
Reads word embeddings from a text file and converts them into a WordEmbedding
object.
Arguments
filepath::AbstractString
: Path to the text file containing word embeddings.T<:Real
: Numeric type for storing embedding values (e.g.,Float32
,Float64
).normalize::Bool
: Whether to normalize vectors to unit length for comparison.separator::Char
: Character used to separate words and vector values in the file.
Throws
SystemError
: If the file cannot be opened or read.ArgumentError
: If the file format is incorrect or missing data.
Returns
WordEmbedding
: A structure containing words and their corresponding embedding vectors.
Example
embeddings = read_text_format("vectors.txt", Float32, true, ' ')
GroupIWord2Vec.reduce_to_2d
— Functionreduce_to_2d(data::Matrix{Float64}, number_of_pc::Int=2) -> Matrix{Float64}
Performs Principal Component Analysis (PCA) to reduce the dimensionality of a given dataset and returns a projected data
Arguments
data::Matrix{Float64}
: The input data matrix where rows represent samples and columns represent features.number_of_pc::Int=2
: The number of principal components to retain (default: 2).
Returns
Matrix{Float64}
: A matrix of shape(number_of_pc × N)
, whereN
is the number of samples, containing the projected data in the reduced dimensional space.
Example
data = randn(100, 50) # 100 samples, 50 features
reduced_data = reduce_to_2d(data, 2)
GroupIWord2Vec.save_custom_model
— Methodsave_custom_model(model::Chain, vocabulary::Dict{String, Int}, path::String)
Saves the model as a txt in the format for load_embeddings()
.
Arguments
model::Chain
: The Flux chain from create_model.vocabulary::Dict
: The vocabulary from create_vocabulary()path::String
: Path to the file for saving.
Notes
- Make sure to choose a file with a .txt ending if you plan to use it with
load_embeddings()
.
Example
save_custom_model(my_model, my_vocabulary, "data/saved_embedd.txt")
GroupIWord2Vec.sequence_text
— Methodsequence_text(path::String, vocabulary::Dict{String, Int})::Vector{Int64}
Transforms a text to a vector of indices that match the words in the vocabulary.
Arguments
path::String
: Path to the textfile as stringvocabulary::Dict{String, Int}
: Vocabulary as a look up table
Returns
Vector{Int64}
: A vector of Integers that contains the text in index form eg: [1, 5, 23, 99, 69, ...]
Example
sequence = sequence_text("data/mydataset.txt", my_vocabulary)
GroupIWord2Vec.show_relations
— Methodshow_relations(words::String...; wv::WordEmbedding, save_path::String="word_relations.png") -> Plots.Plot
Generates a 2D PCA projection of the given word embeddings and visualizes their relationships like this: arg1==>arg2, arg3==>arg4, ... Note: Use an even number of inputs!
Arguments
words::String...
: A list of words to visualize. The number of words must be a multiple of 2.wv::WordEmbedding
: The word embedding structure containing the word vectors.save_path::String="word_relations.png"
: The file path for the generated plot. Not saved if empty or nothing
Throws
ArgumentError
: If the number of words is not a multiple of 2.ArgumentError
: If any of the provided words are not found in the embedding model.
Returns
Plots.Plot
: A scatter plot with arrows representing word relationships.
Example
p = show_relations("king", "queen", "man", "woman"; wv=model, save_path="relations.png")
GroupIWord2Vec.train_custom_model
— Methodtrain_custom_model(model::Chain, dataset::String, vocabulary::Dict ,epochs::Int, window_size::Int; optimizer=Descent(), batchsize=10)::Chain
Trains a model on a given dataset.
Arguments
model::Chain
: The Flux chain from create_model.dataset::String
: Path to the dataset.vocabulary::Dict
: The vocabulary from create_vocabulary()epochs::Int
: Number of desired epochs.window_size::Int
: Window size for the context window. The total window is 2*window size because preceding and following words are used as context.optimizer=Descent()
: Optimizer from Flux used for trainingbatchsize=10
: Number of words trained per epoch. If batchsize = 0 all words in dataset get used once per epoch.
Returns
Chain
: The updated Flux Chain after training.
Notes
- Number of words Trained = epchs*batchsize.
Example
my_updated_model = train_custom_model(my_model, "data/my_dataset.txt", my_vocabulary, 10, 1)
GroupIWord2Vec.train_model
— Methodtrain_model(train::AbstractString, output::AbstractString;
size::Int=100, window::Int=5, sample::AbstractFloat=1e-3,
hs::Int=0, negative::Int=5, threads::Int=12, iter::Int=5,
min_count::Int=5, alpha::AbstractFloat=0.025,
debug::Int=2, binary::Int=0, cbow::Int=1,
save_vocab=Nothing(), read_vocab=Nothing(),
verbose::Bool=false) -> Nothing
Trains a Word2Vec model using the specified parameters.
CAUTION!
This Function can only be used with Linux or MacOS operating systems! MacOS is only supported with Intel processors M1, M2 are not supported!
Arguments
train::AbstractString
: Path to the input text file used for training.output::AbstractString
: Path to save the trained word vectors.size::Int
: Dimensionality of the word vectors (default: 100).window::Int
: Maximum skip length between words (default: 5).sample::AbstractFloat
: Threshold for word occurrence downsampling (default: 1e-3).hs::Int
: Use hierarchical softmax (1 = enabled, 0 = disabled, default: 0).negative::Int
: Number of negative samples (0 = disabled, common values: 5-10, default: 5).threads::Int
: Number of threads for training (default: 12).iter::Int
: Number of training iterations (default: 5).min_count::Int
: Minimum occurrences for a word to be included (default: 5).alpha::AbstractFloat
: Initial learning rate (default: 0.025).debug::Int
: Debugging verbosity level (default: 2).binary::Int
: Save the vectors in binary format (1 = enabled, 0 = disabled, default: 0).cbow::Int
: Use continuous bag-of-words model (1 = CBOW, 0 = Skip-gram, default: 1).save_vocab
: Path to save the vocabulary (default:Nothing()
).read_vocab
: Path to read an existing vocabulary (default:Nothing()
).verbose::Bool
: Print training progress (default:false
).
Throws
SystemError
: If the training process encounters an issue with file paths.ArgumentError
: If input parameters are invalid.
Returns
Nothing
: The function trains the model and saves the output to a file.
Example
train_model("data.txt", "model.vec"; size=200, window=10, iter=10)