You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where $h = v_{w_t}$ is the target word embedding (a row of $W_1$).
Instead of computing a softmax over the entire vocabulary (expensive), we use negative sampling: for each positive context word, we sample $K$ random "negative" words and optimize a binary classification objective.
Gradients
Let $u = w_2^\top h$ be the dot product score. The loss for one (target, context) pair is:
The exponent $0.75$ flattens the distribution relative to raw frequency, giving rare words a higher chance of being drawn as negatives. This prevents the model from only ever contrasting against the most common words.
where $t = 10^{-3}$ and $f(w)$ is the relative frequency of the word. High-frequency words like the, and, of are discarded most aggressively, which reduces noise and speeds up training.
Project structure
word2vec-numpy/
─ word2vec.py # Word2Vec class (model, training, evaluation)
─ train.py # Corpus loading, training script, t-SNE visualization
─ requirements.txt
─ assets
Run
pip install -r requirements.txt
python train.py
Visualization
Trained on Shakespeare's Hamlet (NLTK Gutenberg corpus)
About
Word2Vec Skip-Gram with Negative Sampling implemented in pure NumPy