File size: 3,963 Bytes
acc7268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language:
- en
tags:
- word2vec
- embeddings
- nlp
- sports
- outdoors
- amazon-reviews
metrics:
- semantic similarity
---

# Word2Vec Model for Amazon Sports & Outdoors Reviews

## Model Description

This is a Word2Vec model trained on Amazon product reviews from the Sports & Outdoors category. The model was trained using the Gensim library on 296,337 reviews to learn word embeddings that capture semantic relationships between words in the context of sports and outdoor product reviews.

- **Model type**: Word2Vec (Skip-gram architecture)
- **Training data**: Amazon Sports & Outdoors reviews (296,337 reviews)
- **Vocabulary size**: Dependent on the min_count parameter (words appearing at least twice)
- **Vector dimension**: 100 (Gensim default)
- **Window size**: 10 words

## Intended Uses & Limitations

### Intended Use
This model is designed for:
- Semantic similarity tasks for sports and outdoor-related vocabulary
- Product recommendation systems
- Review analysis and sentiment tasks
- Keyword expansion and related term discovery
- Educational and research purposes

### Limitations
- The model is specialized for the sports and outdoors domain
- Performance on vocabulary outside this domain may be limited
- Inherits any biases present in the Amazon review data
- May not perform well for very recent terminology not present in the training data

## How to Use

### Installation
```bash
pip install gensim pandas
```

### Loading the Model
```python
import gensim

# Load the model
model = gensim.models.Word2Vec.load("word2vec_model.model")
```

### Getting Word Similarities
```python
# Find words similar to "good"
similar_words = model.wv.most_similar("good", topn=5)
print(similar_words)

# Find words similar to "slow"
similar_words = model.wv.most_similar("slow", topn=5)
print(similar_words)
```

### Additional Operations
```python
# Get word vector
vector = model.wv['running']

# Calculate similarity between two words
similarity = model.wv.similarity('hiking', 'outdoors')

# Find odd one out
odd_one = model.wv.doesnt_match(['tent', 'sleeping bag', 'basketball'])
```

## Training Details

### Training Data
The model was trained on the Amazon Sports & Outdoors reviews dataset(https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz) containing 296,337 reviews with 9 columns each. The text was preprocessed using Gensim's `simple_preprocess` function.

### Hyperparameters
- Window size: 10
- Minimum word count: 2
- Vector size: 100 (default)
- Training algorithm: Skip-gram (default)
- Negative samples: 5 (default)
- epochs: 5 (default)

## Evaluation

The model can be evaluated by examining the semantic relationships it captures. For example:
- It should find "excellent", "great", and "nice" similar to "good"
- It should find "fast", "quick" as antonyms to "slow"
- It should maintain sports-specific relationships (e.g., "football" related to "soccer")

## Model Performance

While quantitative evaluation metrics like accuracy on analogy tasks are not provided, the model demonstrates meaningful semantic relationships for vocabulary in the sports and outdoors domain.

## Ethical Considerations

- The model may reflect biases present in the original Amazon reviews
- Should not be used for automated decision making without human oversight
- Users should be aware that word embeddings can amplify societal biases

## Citation

If you use this model in your research, please cite the original Amazon reviews dataset:

```
Please cite one or both of the following if you use the data in any way:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf
}
```

## License

The model is shared for research purposes. The original data follows Amazon's terms of use.
```