chungimungi commited on
Commit
f540bdf
·
verified ·
1 Parent(s): e9d8f54

Upload 10 files

Browse files
README.md ADDED
@@ -0,0 +1,600 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:95253
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: thenlper/gte-base
11
+ widget:
12
+ - source_sentence: Molecular phylogenetic resolution of the mega-diverse clade Apoditrysia
13
+ sentences:
14
+ - In a previous study of higher-level arthropod phylogeny, analyses of nucleotide
15
+ sequences from 62 protein-coding nuclear genes for 80 panarthopod species yielded
16
+ significantly higher bootstrap support for selected nodes than did amino acids.
17
+ This study investigates the cause of that discrepancy. The hypothesis is tested
18
+ that failure to distinguish the serine residues encoded by two disjunct clusters
19
+ of codons (TCN, AGY) in amino acid analyses leads to this discrepancy. In one
20
+ test, the two clusters of serine codons (Ser1, Ser2) are conceptually translated
21
+ as separate amino acids. Analysis of the resulting 21-amino-acid data matrix shows
22
+ striking increases in bootstrap support, in some cases matching that in nucleotide
23
+ analyses. In a second approach, nucleotide and 20-amino-acid data sets are artificially
24
+ altered through targeted deletions, modifications, and replacements, revealing
25
+ the pivotal contributions of distinct Ser1 and Ser2 codons. We confirm that previous
26
+ methods of coding nonsynonymous nucleotide change are robust and computationally
27
+ efficient by introducing two new degeneracy coding methods. We demonstrate for
28
+ degeneracy coding that neither compositional heterogeneity at the level of nucleotides
29
+ nor codon usage bias between Ser1 and Ser2 clusters of codons (or their separately
30
+ coded amino acids) is a major source of non-phylogenetic signal. The incongruity
31
+ in support between amino-acid and nucleotide analyses of the forementioned arthropod
32
+ data set is resolved by showing that "standard" 20-amino-acid analyses yield lower
33
+ node support specifically when serine provides crucial signal. Separate coding
34
+ of Ser1 and Ser2 residues yields support commensurate with that found by degenerated
35
+ nucleotides, without introducing phylogenetic artifacts. While exclusion of all
36
+ serine data leads to reduced support for serine-sensitive nodes, these nodes are
37
+ still recovered in the ML topology, indicating that the enhanced signal from Ser1
38
+ and Ser2 is not qualitatively different from that of the other amino acids.
39
+ - 'Recent molecular phylogenetic studies of the insect order Lepidoptera have robustly
40
+ resolved family-level divergences within most superfamilies, and most divergences
41
+ among the relatively species-poor early-arising superfamilies. In sharp contrast,
42
+ relationships among the superfamilies of more advanced moths and butterflies that
43
+ comprise the mega-diverse clade Apoditrysia (ca. 145,000 spp.) remain mostly poorly
44
+ supported. This uncertainty, in turn, limits our ability to discern the origins,
45
+ ages and evolutionary consequences of traits hypothesized to promote the spectacular
46
+ diversification of Apoditrysia. Low support along the apoditrysian "backbone"
47
+ probably reflects rapid diversification. If so, it may be feasible to strengthen
48
+ resolution by radically increasing the gene sample, but case studies have been
49
+ few. We explored the potential of next-generation sequencing to conclusively resolve
50
+ apoditrysian relationships. We used transcriptome RNA-Seq to generate 1579 putatively
51
+ orthologous gene sequences across a broad sample of 40 apoditrysians plus four
52
+ outgroups, to which we added two taxa from previously published data. Phylogenetic
53
+ analysis of a 46-taxon, 741-gene matrix, resulting from a strict filter that eliminated
54
+ ortholog groups containing any apparent paralogs, yielded dramatic overall increase
55
+ in bootstrap support for deeper nodes within Apoditrysia as compared to results
56
+ from previous and concurrent 19-gene analyses. High support was restricted mainly
57
+ to the huge subclade Obtectomera broadly defined, in which 11 of 12 nodes subtending
58
+ multiple superfamilies had bootstrap support of 100%. The strongly supported nodes
59
+ showed little conflict with groupings from previous studies, and were little affected
60
+ by changes in taxon sampling, suggesting that they reflect true signal rather
61
+ than artifacts of massive gene sampling. In contrast, strong support was seen
62
+ at only 2 of 11 deeper nodes among the "lower", non-obtectomeran apoditrysians.
63
+ These represent a much harder phylogenetic problem, for which one path to resolution
64
+ might include further increase in gene sampling, together with improved orthology
65
+ assignments. '
66
+ - 'One of the major challenges in cell implantation therapies is to promote integration
67
+ of the microcirculation between the implanted cells and the host. We used adipose-derived
68
+ stromal vascular fraction (SVF) cells to vascularize a human liver cell (HepG2)
69
+ implant. We hypothesized that the SVF cells would form a functional microcirculation
70
+ via vascular assembly and inosculation with the host vasculature. Initially, we
71
+ assessed the extent and character of neovasculatures formed by freshly isolated
72
+ and cultured SVF cells and found that freshly isolated cells have a higher vascularization
73
+ potential. Generation of a 3D implant containing fresh SVF and HepG2 cells formed
74
+ a tissue in which HepG2 cells were entwined with a network of microvessels. Implanted
75
+ HepG2 cells sequestered labeled LDL delivered by systemic intravascular injection
76
+ only in SVF-vascularized implants demonstrating that SVF cell-derived vasculatures
77
+ can effectively integrate with host vessels and interface with parenchymal cells
78
+ to form a functional tissue mimic. '
79
+ - source_sentence: Exosomes as drug delivery systems for gastrointestinal cancers
80
+ sentences:
81
+ - Gastrointestinal cancer is one of the most common malignancies with relatively
82
+ high morbidity and mortality. Exosomes are nanosized extracellular vesicles derived
83
+ from most cells and widely distributed in body fluids. They are natural endogenous
84
+ nanocarriers with low immunogenicity, high biocompatibility, and natural targeting,
85
+ and can transport lipids, proteins, DNA, and RNA. Exosomes contain DNA, RNA, proteins,
86
+ lipids, and other bioactive components, which can play a role in information transmission
87
+ and regulation of cellular physiological and pathological processes during the
88
+ progression of gastrointestinal cancer. In this paper, the role of exosomes in
89
+ gastrointestinal cancers is briefly reviewed, with emphasis on the application
90
+ of exosomes as drug delivery systems for gastrointestinal cancers. Finally, the
91
+ challenges faced by exosome-based drug delivery systems are discussed.
92
+ - Background In the myocardium, pericytes are often confused with other interstitial
93
+ cell types, such as fibroblasts. The lack of well-characterized and specific tools
94
+ for identification, lineage tracing, and conditional targeting of myocardial pericytes
95
+ has hampered studies on their role in heart disease. In the current study, we
96
+ characterize and validate specific and reliable strategies for labeling and targeting
97
+ of cardiac pericytes. Methods and Results Using the neuron-glial antigen 2 (NG2)
98
+ - Exosomes are small extracellular vesicles with diameters of 30-150 nm. In both
99
+ physiological and pathological conditions, nearly all types of cells can release
100
+ exosomes, which play important roles in cell communication and epigenetic regulation
101
+ by transporting crucial protein and genetic materials such as miRNA, mRNA, and
102
+ DNA. Consequently, exosome-based disease diagnosis and therapeutic methods have
103
+ been intensively investigated. However, as in any natural science field, the in-depth
104
+ investigation of exosomes relies heavily on technological advances. Historically,
105
+ the two main technical hindrances that have restricted the basic and applied researches
106
+ of exosomes include, first, how to simplify the extraction and improve the yield
107
+ of exosomes and, second, how to effectively distinguish exosomes from other extracellular
108
+ vesicles, especially functional microvesicles. Over the past few decades, although
109
+ a standardized exosome isolation method has still not become available, a number
110
+ of techniques have been established through exploration of the biochemical and
111
+ physicochemical features of exosomes. In this work, by comprehensively analyzing
112
+ the progresses in exosome separation strategies, we provide a panoramic view of
113
+ current exosome isolation techniques, providing perspectives toward the development
114
+ of novel approaches for high-efficient exosome isolation from various types of
115
+ biological matrices. In addition, from the perspective of exosome-based diagnosis
116
+ and therapeutics, we emphasize the issue of quantitative exosome and microvesicle
117
+ separation.
118
+ - source_sentence: Comparison of pesticide active substances in conventional agriculture
119
+ and organic agriculture in Europe
120
+ sentences:
121
+ - Total concentrations of metals in soil are poor predictors of toxicity. In the
122
+ last decade, considerable effort has been made to demonstrate how metal toxicity
123
+ is affected by the abiotic properties of soil. Here this information is collated
124
+ and shows how these data have been used in the European Union for defining predicted-no-effect
125
+ concentrations (PNECs) of Cd, Cu, Co, Ni, Pb, and Zn in soil. Bioavailability
126
+ models have been calibrated using data from more than 500 new chronic toxicity
127
+ tests in soils amended with soluble metal salts, in experimentally aged soils,
128
+ and in field-contaminated soils. In general, soil pH was a good predictor of metal
129
+ solubility but a poor predictor of metal toxicity across soils. Toxicity thresholds
130
+ based on the free metal ion activity were generally more variable than those expressed
131
+ on total soil metal, which can be explained, but not predicted, using the concept
132
+ of the biotic ligand model. The toxicity thresholds based on total soil metal
133
+ concentrations rise almost proportionally to the effective cation exchange capacity
134
+ of soil. Total soil metal concentrations yielding 10% inhibition in freshly amended
135
+ soils were up to 100-fold smaller (median 3.4-fold, n = 110 comparative tests)
136
+ than those in corresponding aged soils or field-contaminated soils. The change
137
+ in isotopically exchangeable metal in soil proved to be a conservative estimate
138
+ of the change in toxicity upon aging. The PNEC values for specific soil types
139
+ were calculated using this information. The corrections for aging and for modifying
140
+ effects of soil properties in metal-salt-amended soils are shown to be the main
141
+ factors by which PNEC values rise above the natural background range.
142
+ - There is much debate about whether the (mostly synthetic) pesticide active substances
143
+ (AS) in conventional agriculture have different non-target effects than the natural
144
+ AS in organic agriculture. We evaluated the official EU pesticide database to
145
+ compare 256 AS that may only be used on conventional farmland with 134 AS that
146
+ are permitted on organic farmland. As a benchmark, we used (i) the hazard classifications
147
+ of the Globally Harmonized System (GHS), and (ii) the dietary and occupational
148
+ health-based guidance values, which were established in the authorization procedure.
149
+ Our comparison showed that 55% of the AS used only in conventional agriculture
150
+ contained health or environmental hazard statements, but only 3% did of the AS
151
+ authorized for organic agriculture. Warnings about possible harm to the unborn
152
+ child, suspected carcinogenicity, or acute lethal effects were found in 16% of
153
+ the AS used in conventional agriculture, but none were found in organic agriculture.
154
+ Furthermore, the establishment of health-based guidance values for dietary and
155
+ non-dietary exposures were relevant by the European authorities for 93% of conventional
156
+ AS, but only for 7% of organic AS. We, therefore, encourage policies and strategies
157
+ to reduce the use and risk of pesticides, and to strengthen organic farming in
158
+ order to protect biodiversity and maintain food security.
159
+ - Herpes simplex virus 1 (HSV-1) encodes Us3 protein kinase, which is critical for
160
+ viral pathogenicity in both mouse peripheral sites (e.g., eyes and vaginas) and
161
+ in the central nervous systems (CNS) of mice after intracranial and peripheral
162
+ inoculations, respectively. Whereas some Us3 substrates involved in Us3 pathogenicity
163
+ in peripheral sites have been reported, those involved in Us3 pathogenicity in
164
+ the CNS remain to be identified. We recently reported that Us3 phosphorylated
165
+ HSV-1 dUTPase (vdUTPase) at serine 187 (Ser-187) in infected cells, and this phosphorylation
166
+ promoted viral replication by regulating optimal enzymatic activity of vdUTPase.
167
+ In the present study, we show that the replacement of vdUTPase Ser-187 by alanine
168
+ (S187A) significantly reduced viral replication and virulence in the CNS of mice
169
+ following intracranial inoculation and that the phosphomimetic substitution at
170
+ vdUTPase Ser-187 in part restored the wild-type viral replication and virulence.
171
+ Interestingly, the S187A mutation in vdUTPase had no effect on viral replication
172
+ and pathogenic effects in the eyes and vaginas of mice after ocular and vaginal
173
+ inoculation, respectively. Similarly, the enzyme-dead mutation in vdUTPase significantly
174
+ reduced viral replication and virulence in the CNS of mice after intracranial
175
+ inoculation, whereas the mutation had no effect on viral replication and pathogenic
176
+ effects in the eyes and vaginas of mice after ocular and vaginal inoculation,
177
+ respectively. These observations suggested that vdUTPase was one of the Us3 substrates
178
+ responsible for Us3 pathogenicity in the CNS and that the CNS-specific virulence
179
+ of HSV-1 involved strict regulation of vdUTPase activity by Us3 phosphorylation.
180
+ - source_sentence: Load-dependent detachment and reattachment kinetics of kinesin-1,
181
+ -2 and 3 motors
182
+ sentences:
183
+ - Bidirectional cargo transport by kinesin and dynein is essential for cell viability
184
+ and defects are linked to neurodegenerative diseases. Computational modeling suggests
185
+ that the load-dependent off-rate is the strongest determinant of which motor 'wins'
186
+ a kinesin-dynein tug-of-war, and optical tweezer experiments find that the load-dependent
187
+ detachment sensitivity of transport kinesins is kinesin-3 > kinesin-2 > kinesin-1.
188
+ However, in reconstituted kinesin-dynein pairs vitro, all three kinesin families
189
+ compete nearly equally well against dynein. Modeling and experiments have confirmed
190
+ that vertical forces inherent to the large trapping beads enhance kinesin-1 dissociation
191
+ rates. In vivo, vertical forces are expected to range from negligible to dominant,
192
+ depending on cargo and microtubule geometries. To investigate the detachment and
193
+ reattachment kinetics of kinesin-1, 2 and 3 motors against loads oriented parallel
194
+ to the microtubule, we created a DNA tensiometer comprising a DNA entropic spring
195
+ attached to the microtubule on one end and a motor on the other. Kinesin dissociation
196
+ rates at stall were slower than detachment rates during unloaded runs, and the
197
+ complex reattachment kinetics were consistent with a weakly-bound 'slip' state
198
+ preceding detachment. Kinesin-3 behaviors under load suggested that long KIF1A
199
+ run lengths result from the concatenation of multiple short runs connected by
200
+ diffusive episodes. Stochastic simulations were able to recapitulate the load-dependent
201
+ detachment and reattachment kinetics for all three motors and provide direct comparison
202
+ of key transition rates between families. These results provide insight into how
203
+ kinesin-1, -2 and -3 families transport cargo in complex cellular geometries and
204
+ compete against dynein during bidirectional transport.
205
+ - 'AP-1 and AP-2 adaptor protein (AP) complexes mediate clathrin-dependent trafficking
206
+ at the trans-Golgi network (TGN) and the plasma membrane, respectively. Whereas
207
+ AP-1 is required for trafficking to plasma membrane and vacuoles, AP-2 mediates
208
+ endocytosis. These AP complexes consist of four subunits (adaptins): two large
209
+ subunits (β1 and γ for AP-1 and β2 and α for AP-2), a medium subunit μ, and a
210
+ small subunit σ. In general, adaptins are unique to each AP complex, with the
211
+ exception of β subunits that are shared by AP-1 and AP-2 in some invertebrates.
212
+ Here, we show that the two putative Arabidopsis thaliana AP1/2β adaptins co-assemble
213
+ with both AP-1 and AP-2 subunits and regulate exocytosis and endocytosis in root
214
+ cells, consistent with their dual localization at the TGN and plasma membrane.
215
+ Deletion of both β adaptins is lethal in plants. We identified a critical role
216
+ of β adaptins in pollen wall formation and reproduction, involving the regulation
217
+ of membrane trafficking in the tapetum and pollen germination. In tapetal cells,
218
+ β adaptins localize almost exclusively to the TGN and mediate exocytosis of the
219
+ plasma membrane transporters such as ATP-binding cassette (ABC)G9 and ABCG16.
220
+ This study highlights the essential role of AP1/2β adaptins in plants and their
221
+ specialized roles in specific cell types.'
222
+ - A single kinesin molecule can move "processively" along a microtubule for more
223
+ than 1 micrometer before detaching from it. The prevailing explanation for this
224
+ processive movement is the "walking model," which envisions that each of two motor
225
+ domains (heads) of the kinesin molecule binds coordinately to the microtubule.
226
+ This implies that each kinesin molecule must have two heads to "walk" and that
227
+ a single-headed kinesin could not move processively. Here, a motor-domain construct
228
+ of KIF1A, a single-headed kinesin superfamily protein, was shown to move processively
229
+ along the microtubule for more than 1 micrometer. The movement along the microtubules
230
+ was stochastic and fitted a biased Brownian-movement model.
231
+ - source_sentence: Phylogenetic analysis of mitochondrial genes in Macquarie perch
232
+ from three river basins
233
+ sentences:
234
+ - Sedentary behavior is an emerging risk factor for cardiovascular disease (CVD)
235
+ and may be particularly relevant to the cardiovascular health of older adults.
236
+ This scoping review describes the existing literature examining the prevalence
237
+ of sedentary time in older adults with CVD and the association of sedentary behavior
238
+ with cardiovascular risk in older adults. We found that older adults with CVD
239
+ spend >75 % of their waking day sedentary, and that sedentary time is higher among
240
+ older adults with CVD than among older adults without CVD. High sedentary behavior
241
+ is consistently associated with worse cardiac lipid profiles and increased cardiac
242
+ risk scores in older adults; the associations of sedentary behavior with blood
243
+ pressure, CVD incidence, and CVD-related mortality among older adults are less
244
+ clear. Future research with larger sample sizes using validated methods to measure
245
+ sedentary behavior are needed to clarify the association between sedentary behavior
246
+ and cardiovascular outcomes in older adults.
247
+ - An improved Bayesian method is presented for estimating phylogenetic trees using
248
+ DNA sequence data. The birth-death process with species sampling is used to specify
249
+ the prior distribution of phylogenies and ancestral speciation times, and the
250
+ posterior probabilities of phylogenies are used to estimate the maximum posterior
251
+ probability (MAP) tree. Monte Carlo integration is used to integrate over the
252
+ ancestral speciation times for particular trees. A Markov Chain Monte Carlo method
253
+ is used to generate the set of trees with the highest posterior probabilities.
254
+ Methods are described for an empirical Bayesian analysis, in which estimates of
255
+ the speciation and extinction rates are used in calculating the posterior probabilities,
256
+ and a hierarchical Bayesian analysis, in which these parameters are removed from
257
+ the model by an additional integration. The Markov Chain Monte Carlo method avoids
258
+ the requirement of our earlier method for calculating MAP trees to sum over all
259
+ possible topologies (which limited the number of taxa in an analysis to about
260
+ five). The methods are applied to analyze DNA sequences for nine species of primates,
261
+ and the MAP tree, which is identical to a maximum-likelihood estimate of topology,
262
+ has a probability of approximately 95%.
263
+ - 'Genetic variation in mitochondrial genes could underlie metabolic adaptations
264
+ because mitochondrially encoded proteins are directly involved in a pathway supplying
265
+ energy to metabolism. Macquarie perch from river basins exposed to different climates
266
+ differ in size and growth rate, suggesting potential presence of adaptive metabolic
267
+ differences. We used complete mitochondrial genome sequences to build a phylogeny,
268
+ estimate lineage divergence times and identify signatures of purifying and positive
269
+ selection acting on mitochondrial genes for 25 Macquarie perch from three basins:
270
+ Murray-Darling Basin (MDB), Hawkesbury-Nepean Basin (HNB) and Shoalhaven Basin
271
+ (SB). Phylogenetic analysis resolved basin-level clades, supporting incipient
272
+ speciation previously inferred from differentiation in allozymes, microsatellites
273
+ and mitochondrial control region. The estimated time of lineage divergence suggested
274
+ an early- to mid-Pleistocene split between SB and the common ancestor of HNB+MDB,
275
+ followed by mid-to-late Pleistocene splitting between HNB and MDB. These divergence
276
+ estimates are more recent than previous ones. Our analyses suggested that evolutionary
277
+ drivers differed between inland MDB and coastal HNB. In the cooler and more climatically
278
+ variable MDB, mitogenomes evolved under strong purifying selection, whereas in
279
+ the warmer and more climatically stable HNB, purifying selection was relaxed.
280
+ Evidence for relaxed selection in the HNB includes elevated transfer RNA and 16S
281
+ ribosomal RNA polymorphism, presence of potentially mildly deleterious mutations
282
+ and a codon (ATP6'
283
+ pipeline_tag: sentence-similarity
284
+ library_name: sentence-transformers
285
+ ---
286
+
287
+ # SentenceTransformer based on thenlper/gte-base
288
+
289
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [thenlper/gte-base](https://huggingface.co/thenlper/gte-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
290
+
291
+ ## Model Details
292
+
293
+ ### Model Description
294
+ - **Model Type:** Sentence Transformer
295
+ - **Base model:** [thenlper/gte-base](https://huggingface.co/thenlper/gte-base) <!-- at revision c078288308d8dee004ab72c6191778064285ec0c -->
296
+ - **Maximum Sequence Length:** 512 tokens
297
+ - **Output Dimensionality:** 768 dimensions
298
+ - **Similarity Function:** Cosine Similarity
299
+ <!-- - **Training Dataset:** Unknown -->
300
+ <!-- - **Language:** Unknown -->
301
+ <!-- - **License:** Unknown -->
302
+
303
+ ### Model Sources
304
+
305
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
306
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
307
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
308
+
309
+ ### Full Model Architecture
310
+
311
+ ```
312
+ SentenceTransformer(
313
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
314
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
315
+ (2): Normalize()
316
+ )
317
+ ```
318
+
319
+ ## Usage
320
+
321
+ ### Direct Usage (Sentence Transformers)
322
+
323
+ First install the Sentence Transformers library:
324
+
325
+ ```bash
326
+ pip install -U sentence-transformers
327
+ ```
328
+
329
+ Then you can load this model and run inference.
330
+ ```python
331
+ from sentence_transformers import SentenceTransformer
332
+
333
+ # Download from the 🤗 Hub
334
+ model = SentenceTransformer("sentence_transformers_model_id")
335
+ # Run inference
336
+ sentences = [
337
+ 'Phylogenetic analysis of mitochondrial genes in Macquarie perch from three river basins',
338
+ 'Genetic variation in mitochondrial genes could underlie metabolic adaptations because mitochondrially encoded proteins are directly involved in a pathway supplying energy to metabolism. Macquarie perch from river basins exposed to different climates differ in size and growth rate, suggesting potential presence of adaptive metabolic differences. We used complete mitochondrial genome sequences to build a phylogeny, estimate lineage divergence times and identify signatures of purifying and positive selection acting on mitochondrial genes for 25 Macquarie perch from three basins: Murray-Darling Basin (MDB), Hawkesbury-Nepean Basin (HNB) and Shoalhaven Basin (SB). Phylogenetic analysis resolved basin-level clades, supporting incipient speciation previously inferred from differentiation in allozymes, microsatellites and mitochondrial control region. The estimated time of lineage divergence suggested an early- to mid-Pleistocene split between SB and the common ancestor of HNB+MDB, followed by mid-to-late Pleistocene splitting between HNB and MDB. These divergence estimates are more recent than previous ones. Our analyses suggested that evolutionary drivers differed between inland MDB and coastal HNB. In the cooler and more climatically variable MDB, mitogenomes evolved under strong purifying selection, whereas in the warmer and more climatically stable HNB, purifying selection was relaxed. Evidence for relaxed selection in the HNB includes elevated transfer RNA and 16S ribosomal RNA polymorphism, presence of potentially mildly deleterious mutations and a codon (ATP6',
339
+ 'An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data. The birth-death process with species sampling is used to specify the prior distribution of phylogenies and ancestral speciation times, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree. Monte Carlo integration is used to integrate over the ancestral speciation times for particular trees. A Markov Chain Monte Carlo method is used to generate the set of trees with the highest posterior probabilities. Methods are described for an empirical Bayesian analysis, in which estimates of the speciation and extinction rates are used in calculating the posterior probabilities, and a hierarchical Bayesian analysis, in which these parameters are removed from the model by an additional integration. The Markov Chain Monte Carlo method avoids the requirement of our earlier method for calculating MAP trees to sum over all possible topologies (which limited the number of taxa in an analysis to about five). The methods are applied to analyze DNA sequences for nine species of primates, and the MAP tree, which is identical to a maximum-likelihood estimate of topology, has a probability of approximately 95%.',
340
+ ]
341
+ embeddings = model.encode(sentences)
342
+ print(embeddings.shape)
343
+ # [3, 768]
344
+
345
+ # Get the similarity scores for the embeddings
346
+ similarities = model.similarity(embeddings, embeddings)
347
+ print(similarities)
348
+ # tensor([[1.0000, 0.9449, 0.8056],
349
+ # [0.9449, 1.0000, 0.7868],
350
+ # [0.8056, 0.7868, 1.0000]])
351
+ ```
352
+
353
+ <!--
354
+ ### Direct Usage (Transformers)
355
+
356
+ <details><summary>Click to see the direct usage in Transformers</summary>
357
+
358
+ </details>
359
+ -->
360
+
361
+ <!--
362
+ ### Downstream Usage (Sentence Transformers)
363
+
364
+ You can finetune this model on your own dataset.
365
+
366
+ <details><summary>Click to expand</summary>
367
+
368
+ </details>
369
+ -->
370
+
371
+ <!--
372
+ ### Out-of-Scope Use
373
+
374
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
375
+ -->
376
+
377
+ <!--
378
+ ## Bias, Risks and Limitations
379
+
380
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
381
+ -->
382
+
383
+ <!--
384
+ ### Recommendations
385
+
386
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
387
+ -->
388
+
389
+ ## Training Details
390
+
391
+ ### Training Dataset
392
+
393
+ #### Unnamed Dataset
394
+
395
+ * Size: 95,253 training samples
396
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
397
+ * Approximate statistics based on the first 1000 samples:
398
+ | | sentence_0 | sentence_1 | sentence_2 |
399
+ |:--------|:----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
400
+ | type | string | string | string |
401
+ | details | <ul><li>min: 6 tokens</li><li>mean: 19.51 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 223.97 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 51 tokens</li><li>mean: 309.24 tokens</li><li>max: 512 tokens</li></ul> |
402
+ * Samples:
403
+ | sentence_0 | sentence_1 | sentence_2 |
404
+ |:----------------------------------------------------------------------------||:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
405
+ | <code>Sox5 modulates the activity of Sox10 in the melanocyte lineage</code> | <code>The transcription factor Sox5 has previously been shown in chicken to be expressed in early neural crest cells and neural crest-derived peripheral glia. Here, we show in mouse that Sox5 expression also continues after neural crest specification in the melanocyte lineage. Despite its continued expression, Sox5 has little impact on melanocyte development on its own as generation of melanoblasts and melanocytes is unaltered in Sox5-deficient mice. Loss of Sox5, however, partially rescued the strongly reduced melanoblast generation and marker gene expression in Sox10 heterozygous mice arguing that Sox5 functions in the melanocyte lineage by modulating Sox10 activity. This modulatory activity involved Sox5 binding and recruitment of CtBP2 and HDAC1 to the regulatory regions of melanocytic Sox10 target genes and direct inhibition of Sox10-dependent promoter activation. Both binding site competition and recruitment of corepressors thus help Sox5 to modulate the activity of Sox10 in the melano...</code> | <code>Transcripts for a new form of Sox5, called L-Sox5, and Sox6 are coexpressed with Sox9 in all chondrogenic sites of mouse embryos. A coiled-coil domain located in the N-terminal part of L-Sox5, and absent in Sox5, showed >90% identity with a similar domain in Sox6 and mediated homodimerization and heterodimerization with Sox6. Dimerization of L-Sox5/Sox6 greatly increased efficiency of binding of the two Sox proteins to DNA containing adjacent HMG sites. L-Sox5, Sox6 and Sox9 cooperatively activated expression of the chondrocyte differentiation marker Col2a1 in 10T1/2 and MC615 cells. A 48 bp chondrocyte-specific enhancer in this gene, which contains several HMG-like sites that are necessary for enhancer activity, bound the three Sox proteins and was cooperatively activated by the three Sox proteins in non-chondrogenic cells. Our data suggest that L-Sox5/Sox6 and Sox9, which belong to two different classes of Sox transcription factors, cooperate with each other in expression of Col2a1 a...</code> |
406
+ | <code>are asgard archaea related to eukaryotes</code> | <code>Asgard archaea are considered to be the closest known relatives of eukaryotes. Their genomes contain hundreds of eukaryotic signature proteins (ESPs), which inspired hypotheses on the evolution of the eukaryotic cell</code> | <code>Eukaryotes evolved from a symbiosis involving alphaproteobacteria and archaea phylogenetically nested within the Asgard clade. Two recent studies explore the metabolic capabilities of Asgard lineages, supporting refined symbiotic metabolic interactions that might have operated at the dawn of eukaryogenesis.</code> |
407
+ | <code>Fanconi Anemia in Pediatric Medulloblastoma and Fanconi Anemia</code> | <code>The outcome of children with medulloblastoma (MB) and Fanconi Anemia (FA), an inherited DNA repair deficiency, has not been described systematically. Treatment is complicated by high vulnerability to treatment-associated side effects, yet structured data are lacking. This study aims to give a comprehensive overview of clinical and molecular characteristics of pediatric FA MB patients.</code> | <code>The Sonic Hedgehog (SHH) signaling pathway is indispensable for development, and functions to activate a transcriptional program modulated by the GLI transcription factors. Here, we report that loss of a regulator of the SHH pathway, Suppressor of Fused (Sufu), resulted in early embryonic lethality in the mouse similar to inactivation of another SHH regulator, Patched1 (Ptch1). In contrast to Ptch1+/- mice, Sufu+/- mice were not tumor prone. However, in conjunction with p53 loss, Sufu+/- animals developed tumors including medulloblastoma and rhabdomyosarcoma. Tumors present in Sufu+/-p53-/- animals resulted from Sufu loss of heterozygosity. Sufu+/-p53-/- medulloblastomas also expressed a signature gene expression profile typical of aberrant SHH signaling, including upregulation of N-myc, Sfrp1, Ptch2 and cyclin D1. Finally, the Smoothened inhibitor, hedgehog antagonist, did not block growth of tumors arising from Sufu inactivation. These data demonstrate that Sufu is essential for deve...</code> |
408
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
409
+ ```json
410
+ {
411
+ "scale": 20.0,
412
+ "similarity_fct": "cos_sim"
413
+ }
414
+ ```
415
+
416
+ ### Training Hyperparameters
417
+ #### Non-Default Hyperparameters
418
+
419
+ - `per_device_train_batch_size`: 16
420
+ - `per_device_eval_batch_size`: 16
421
+ - `num_train_epochs`: 1
422
+ - `max_steps`: 20
423
+ - `multi_dataset_batch_sampler`: round_robin
424
+
425
+ #### All Hyperparameters
426
+ <details><summary>Click to expand</summary>
427
+
428
+ - `overwrite_output_dir`: False
429
+ - `do_predict`: False
430
+ - `eval_strategy`: no
431
+ - `prediction_loss_only`: True
432
+ - `per_device_train_batch_size`: 16
433
+ - `per_device_eval_batch_size`: 16
434
+ - `per_gpu_train_batch_size`: None
435
+ - `per_gpu_eval_batch_size`: None
436
+ - `gradient_accumulation_steps`: 1
437
+ - `eval_accumulation_steps`: None
438
+ - `torch_empty_cache_steps`: None
439
+ - `learning_rate`: 5e-05
440
+ - `weight_decay`: 0.0
441
+ - `adam_beta1`: 0.9
442
+ - `adam_beta2`: 0.999
443
+ - `adam_epsilon`: 1e-08
444
+ - `max_grad_norm`: 1
445
+ - `num_train_epochs`: 1
446
+ - `max_steps`: 20
447
+ - `lr_scheduler_type`: linear
448
+ - `lr_scheduler_kwargs`: {}
449
+ - `warmup_ratio`: 0.0
450
+ - `warmup_steps`: 0
451
+ - `log_level`: passive
452
+ - `log_level_replica`: warning
453
+ - `log_on_each_node`: True
454
+ - `logging_nan_inf_filter`: True
455
+ - `save_safetensors`: True
456
+ - `save_on_each_node`: False
457
+ - `save_only_model`: False
458
+ - `restore_callback_states_from_checkpoint`: False
459
+ - `no_cuda`: False
460
+ - `use_cpu`: False
461
+ - `use_mps_device`: False
462
+ - `seed`: 42
463
+ - `data_seed`: None
464
+ - `jit_mode_eval`: False
465
+ - `use_ipex`: False
466
+ - `bf16`: False
467
+ - `fp16`: False
468
+ - `fp16_opt_level`: O1
469
+ - `half_precision_backend`: auto
470
+ - `bf16_full_eval`: False
471
+ - `fp16_full_eval`: False
472
+ - `tf32`: None
473
+ - `local_rank`: 0
474
+ - `ddp_backend`: None
475
+ - `tpu_num_cores`: None
476
+ - `tpu_metrics_debug`: False
477
+ - `debug`: []
478
+ - `dataloader_drop_last`: False
479
+ - `dataloader_num_workers`: 0
480
+ - `dataloader_prefetch_factor`: None
481
+ - `past_index`: -1
482
+ - `disable_tqdm`: False
483
+ - `remove_unused_columns`: True
484
+ - `label_names`: None
485
+ - `load_best_model_at_end`: False
486
+ - `ignore_data_skip`: False
487
+ - `fsdp`: []
488
+ - `fsdp_min_num_params`: 0
489
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
490
+ - `fsdp_transformer_layer_cls_to_wrap`: None
491
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
492
+ - `deepspeed`: None
493
+ - `label_smoothing_factor`: 0.0
494
+ - `optim`: adamw_torch
495
+ - `optim_args`: None
496
+ - `adafactor`: False
497
+ - `group_by_length`: False
498
+ - `length_column_name`: length
499
+ - `ddp_find_unused_parameters`: None
500
+ - `ddp_bucket_cap_mb`: None
501
+ - `ddp_broadcast_buffers`: False
502
+ - `dataloader_pin_memory`: True
503
+ - `dataloader_persistent_workers`: False
504
+ - `skip_memory_metrics`: True
505
+ - `use_legacy_prediction_loop`: False
506
+ - `push_to_hub`: False
507
+ - `resume_from_checkpoint`: None
508
+ - `hub_model_id`: None
509
+ - `hub_strategy`: every_save
510
+ - `hub_private_repo`: None
511
+ - `hub_always_push`: False
512
+ - `gradient_checkpointing`: False
513
+ - `gradient_checkpointing_kwargs`: None
514
+ - `include_inputs_for_metrics`: False
515
+ - `include_for_metrics`: []
516
+ - `eval_do_concat_batches`: True
517
+ - `fp16_backend`: auto
518
+ - `push_to_hub_model_id`: None
519
+ - `push_to_hub_organization`: None
520
+ - `mp_parameters`:
521
+ - `auto_find_batch_size`: False
522
+ - `full_determinism`: False
523
+ - `torchdynamo`: None
524
+ - `ray_scope`: last
525
+ - `ddp_timeout`: 1800
526
+ - `torch_compile`: False
527
+ - `torch_compile_backend`: None
528
+ - `torch_compile_mode`: None
529
+ - `include_tokens_per_second`: False
530
+ - `include_num_input_tokens_seen`: False
531
+ - `neftune_noise_alpha`: None
532
+ - `optim_target_modules`: None
533
+ - `batch_eval_metrics`: False
534
+ - `eval_on_start`: False
535
+ - `use_liger_kernel`: False
536
+ - `eval_use_gather_object`: False
537
+ - `average_tokens_across_devices`: False
538
+ - `prompts`: None
539
+ - `batch_sampler`: batch_sampler
540
+ - `multi_dataset_batch_sampler`: round_robin
541
+ - `router_mapping`: {}
542
+ - `learning_rate_mapping`: {}
543
+
544
+ </details>
545
+
546
+ ### Framework Versions
547
+ - Python: 3.10.14
548
+ - Sentence Transformers: 5.0.0
549
+ - Transformers: 4.52.4
550
+ - PyTorch: 2.6.0+cu124
551
+ - Accelerate: 1.6.0
552
+ - Datasets: 3.6.0
553
+ - Tokenizers: 0.21.1
554
+
555
+ ## Citation
556
+
557
+ ### BibTeX
558
+
559
+ #### Sentence Transformers
560
+ ```bibtex
561
+ @inproceedings{reimers-2019-sentence-bert,
562
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
563
+ author = "Reimers, Nils and Gurevych, Iryna",
564
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
565
+ month = "11",
566
+ year = "2019",
567
+ publisher = "Association for Computational Linguistics",
568
+ url = "https://arxiv.org/abs/1908.10084",
569
+ }
570
+ ```
571
+
572
+ #### MultipleNegativesRankingLoss
573
+ ```bibtex
574
+ @misc{henderson2017efficient,
575
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
576
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
577
+ year={2017},
578
+ eprint={1705.00652},
579
+ archivePrefix={arXiv},
580
+ primaryClass={cs.CL}
581
+ }
582
+ ```
583
+
584
+ <!--
585
+ ## Glossary
586
+
587
+ *Clearly define terms in order to be accessible across audiences.*
588
+ -->
589
+
590
+ <!--
591
+ ## Model Card Authors
592
+
593
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
594
+ -->
595
+
596
+ <!--
597
+ ## Model Card Contact
598
+
599
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
600
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.52.4",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.0.0",
5
+ "transformers": "4.52.4",
6
+ "pytorch": "2.6.0+cu124"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4951e1b8dca548632eeeb2deab1ff2a45497fc6df610bf7cf0d205902e7a2186
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 128,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff