Add new CrossEncoder model

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +408 -0
block.py +470 -0
config.json +51 -0
configuration_xlm_roberta.py +69 -0
embedding.py +62 -0
mha.py +662 -0
mlp.py +194 -0
model.safetensors +3 -0
modeling_xlm_roberta.py +1119 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +55 -0
xlm_padding.py +218 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,408 @@

+---
+tags:
+- sentence-transformers
+- cross-encoder
+- reranker
+- generated_from_trainer
+- dataset_size:3200
+- loss:CachedMultipleNegativesRankingLoss
+base_model: jinaai/jina-reranker-v2-base-multilingual
+pipeline_tag: text-ranking
+library_name: sentence-transformers
+metrics:
+- map
+- mrr@10
+- ndcg@10
+model-index:
+- name: CrossEncoder based on jinaai/jina-reranker-v2-base-multilingual
+  results:
+  - task:
+      type: cross-encoder-reranking
+      name: Cross Encoder Reranking
+    dataset:
+      name: jina reranker v2 base multilingual contrastive parl 4 10ep
+      type: jina-reranker-v2-base-multilingual-contrastive-parl-4-10ep
+    metrics:
+    - type: map
+      value: 0.0194
+      name: Map
+    - type: mrr@10
+      value: 0.0194
+      name: Mrr@10
+    - type: ndcg@10
+      value: 0.0198
+      name: Ndcg@10
+---
+# CrossEncoder based on jinaai/jina-reranker-v2-base-multilingual
+This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [jinaai/jina-reranker-v2-base-multilingual](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
+## Model Details
+### Model Description
+- **Model Type:** Cross Encoder
+- **Base model:** [jinaai/jina-reranker-v2-base-multilingual](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual) <!-- at revision 2f894e63642a95228da19cdd583cd2309983c867 -->
+- **Maximum Sequence Length:** 1024 tokens
+- **Number of Output Labels:** 1 label
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import CrossEncoder
+# Download from the 🤗 Hub
+model = CrossEncoder("cuadron11/jina-reranker-v2-base-multilingual-contrastive-parl-4-10ep")
+# Get scores for pairs of texts
+pairs = [
+    ['Zer gertatu zen martxoaren 3an Euskal Autonomia Erkidegoan?', '[TOPIC: Honako ekimen hauek batera eztabaidatu eta behin betiko ebazpena hartzea: ]\n[UNZALU HERMOSA, (SV-ES)]:\nSekula. Gertatzen dena da uste dugula martxoaren 3ko jokaerak baduela zer hobetua. Eta hobetzeko abiapuntu bakarra gogoeta egitea da, aztertzea eta hasieratik aitortzea hutsegiteak egin zirela. Izan ere, nire lehenengo hitzaldian esan dudanez, triskantzak gertatu izanak pentsarazi behar liguke zerbaitek huts egin zuela egun hartako dispositiboa edo operazioa planifikatzean eta zuzentzean. Horixe sartu nahi dugu guk: eztabaida-elementuak, hobekuntzarako kritika-elementuak, eta UPyDrekin eta Alderdi Popularrarekin sinatu dugun zuzenketan hori esaten da, onar dadila gauzak hobetu egin daitezkeela. Izan ere, Iturrate jauna, zuk egin dizkiguzun galderei nik beste batzuekin erantzungo nieke. Posible da hutsegiteetatik ikastea eta herritarren segurtasuna hobetzea? Posible da? Edo, besterik gabe, "Ahal zen modu bakarrean jokatu dugu" esatera mugatu behar dugu? Posible da herritarrei kalte gutxiago eragitea horrelako istiluak gertatzen direnean? Horixe planteatu nahi dugu guk, beharrezkoa dela… Eta uste osoa dugunez hobetu daitekeela, eta uste osoa dugunez hobeto joka zitekeela, horregatik nahi dugu eta horregatik planteatzen dugu hutsegiteak aztertzea, gogoeta egitea, eta elementu zuzentzaileak martxan jartzea horrelako egoerarik berriro gerta ez dadin. Eta, begira, sailarekin batera dispositiboari babesa eman dioten bakarrak dira, hain justu, Ertzaintzaren jokaerak inoiz babesten ez dituztenak; lehen esan dudanez, Ertzaintzaren kontrako ekintzak ere gaitzetsi ez dituztenak. Eta horrek kezkatu egiten gaitu. Nik ez dakit zu, Iturrate jauna, eta sailburu andrea kezkatzen zaituzten; baina, (Date: 03.04.2014)'],
+    ['Zenbat denbora behar da Ertzaintzako promozio baten deialdia egiten denetik agenteak kalera irteten diren arte?', '[TOPIC: Interpelazioa, Javier Ruiz de Arbulo Cerio Euskal Talde Popularreko legebiltzarkideak Segurtasuneko sailburuari egina, Arabako Miñoien Atalari buruz]\n[SEGURTASUNEKO SAILBURUAK (BELTRÁN DE HEREDIA ARRONIZ), (EA-NV)]:\nhoriek aldatu egiten dira egun batetik bestera, unitate batetik bestera, kontuan hartuta zer bilakaera duten erretiroek, kontuan hartuta nola gertatzen diren baja horiek… Baina, batez ere, nik bezain ondo dakizu Ertzaintzan defizit handia daukagula, eta ezin hobeto dakizu zergatia zein den. Ez dakit defizit horren zergatia zein den errepika diezazudan etorri zaren hona, baina ez daukat inolako eragozpenik Legebiltzar honetan berriro azaltzeko eta zuek berriro entzun behar izateko. Honela gaude Espainiako Gobernuak, Alderdi Popularraren Gobernuak, denbora asko behar izan zuelako, denbora gehiegi, zuk behar izan duzun be- zala, ulertzeko premia geneukala Ertzaintzan gertatzen ari ziren erretiro-bajak estaltzeko promozio berriak deitzeko –gero eta gehiago dira erretiroak eragindako bajak–; logikoa denez, baja horiek eragina zeukaten eta daukate Miñoien Atalean ere, bajak oraindik ere gertatzen ari baitira. 26. promozioa hautatzeko prozesua urtebete baino gehiago atzeratu da, errekurtsoek mehatxatu egin zituztelako 25. promozioaren bilakaera normala eta amaiera. Nik uste dut orain bide onetik goazela, baina ez duzu ahaztu behar promozio baten deialdia egiten dugunetik agenteak kalera irteten diren arte bi urte baino gehiago igarotzen direla. Bi urte baino gehiago. Eta ziztu bizian ibili ginen, betoa amaitu orduko azterketak egiteko: hogei egun eskas behar izan genituen 26. promozioko azterketen deialdia egiteko. Ziztu bizian ibili ginen, baina, hala ere, kale. Denbora eman behar da, ezta? Hemen, urdaiazpikoekin bezala geratzen da: denbora eman behar zaie, ontzeko. Bada, (Date: 01.12.2017)'],
+    ['Zergatik dimititu zuen Eusko Jaurlaritzako Komunikazio zuzendariak?', '[TOPIC: Galdera, Gorka Maneiro Labayen Mistoa-UPyD taldeko legebiltzarkideak lehendakariari egina, Eusko Jaurlaritzako Komunikazio zuzendariaren dimisioaren ondoren hartu beharreko erantzukizun politikoei buruz]\n[MANEIRO LABAYEN, (Mixto-UPyD)]:\nsailburu jakin batzuei elkarrizketak egitearen truke? Erantzun ahal diezaiokezu galdera horri? Halaxe da, bai. Zure esanetan, ez dago ezer arrarorik eta irregularrik, baina pertsona batek dimititu egin du. Zer egiteko asmoa duzu zuk? Bide batez, zer da aldi baterako dimisioaren kontu hori? Beste postu batean jarri al duzue pertsona hori? Diru publikoa kobratzen jarraitzen al du? Argitu dezakezu, edo herritarrak engainatu nahi dituzue? Pertsona horrek dimititu egin du. Zer egiteko (Date: 30.10.2015)'],
+    ['Zein da euskal herritarren iritzia independentziari buruz, Soziometroaren arabera?', '[TOPIC: Mozioa, Maddalen Iriarte Okiñena EH Bildu taldeko legebiltzarkideak aurkeztua, herri bezala ditugun erronka estrategikoei erantzuteko, herri-jakintza aktibatzeko eta ariketa kolektibo bat egiteko beharraren inguruan. Eztabaida eta behin betiko ebazpena]\n[BARRIO BAROJA, (PV-ETP)]:\nasko; eta ezin dela horren autokonplazientea izan eta dena positiboki egin dela esan. Argi dago, Iriarte andrea, amaitzeko, etorkizuneko erronkak ditugula; ados gaude gogor lan egin behar dela; baina estatus berria herritarrei arazo gehiago sortzea da; hura agerian jartzea eta hona ekartzea, berriz ere konfrontazio- eta eztabaida-eremu izatea da, herritarrei arazo gehiago sortzea da. Atzo argi eta garbi zioen euskal Soziometroak euskal herritarrok independentziari buruz zer iritzi dugu; eta inoiz ez da hain maila baxurik ikusi. Beraz, ildo horretan, erronka estrategikoei buruz hitz egiten ari zaren une honetan, estatus berriaren eztabaida hona ekartzea atzerapausoa litzateke, arazo gehiago ematea litzateke; eta, jakina, gu –zuri erantzuten dizut, baita orain hura aldarrikatu duen Egibar jaunari ere esaten diot– aurka egongo gara. Eskerrik asko. (Date: 10.06.2021)'],
+    ['Zeintzuk dira Eusko Jaurlaritzaren asmoak euskararen normalizazioan sakontzeko?', '[TOPIC: Galdera, Rebeka Ubera Aranzeta EH Bildu taldeko legebiltzarkideak Kultura eta Hizkuntza Politikako sailburuari egina, euskararen normalizazioan sakontzeko neurri funtsezkoak hartzeari buruz]\n[UBERA ARANZETA, (EH Bildu)]:\nAdministrazioa euskalduntzeko urratsak emango zirela: ekarpenak egin ditugu eta ezezkoa jaso dugu. Esan zitzaigun euskara ikastea doako bilakatzeko urratsak emango zirela, eta mugak besterik ez dugu ikusi eta ezezkoa jaso dugu. Eta jada dagoeneko zalantzan jartzen hasiak gara Gobernu honen borondate politikoa zein den. Eta, legegintzaldi honetan, sailburuen aldetik ere, atzerakada izugarria izan da, aurreko legegintzaldiarekin konparatuta –nabarmen gainera–, eta zentzu horretan ere, zerbait egin beharko duzu. Neurtzen ari (Date: 19.05.2017)'],
+]
+scores = model.predict(pairs)
+print(scores.shape)
+# (5,)
+# Or rank different texts based on similarity to a single text
+ranks = model.rank(
+    'Zer gertatu zen martxoaren 3an Euskal Autonomia Erkidegoan?',
+    [
+        '[TOPIC: Honako ekimen hauek batera eztabaidatu eta behin betiko ebazpena hartzea: ]\n[UNZALU HERMOSA, (SV-ES)]:\nSekula. Gertatzen dena da uste dugula martxoaren 3ko jokaerak baduela zer hobetua. Eta hobetzeko abiapuntu bakarra gogoeta egitea da, aztertzea eta hasieratik aitortzea hutsegiteak egin zirela. Izan ere, nire lehenengo hitzaldian esan dudanez, triskantzak gertatu izanak pentsarazi behar liguke zerbaitek huts egin zuela egun hartako dispositiboa edo operazioa planifikatzean eta zuzentzean. Horixe sartu nahi dugu guk: eztabaida-elementuak, hobekuntzarako kritika-elementuak, eta UPyDrekin eta Alderdi Popularrarekin sinatu dugun zuzenketan hori esaten da, onar dadila gauzak hobetu egin daitezkeela. Izan ere, Iturrate jauna, zuk egin dizkiguzun galderei nik beste batzuekin erantzungo nieke. Posible da hutsegiteetatik ikastea eta herritarren segurtasuna hobetzea? Posible da? Edo, besterik gabe, "Ahal zen modu bakarrean jokatu dugu" esatera mugatu behar dugu? Posible da herritarrei kalte gutxiago eragitea horrelako istiluak gertatzen direnean? Horixe planteatu nahi dugu guk, beharrezkoa dela… Eta uste osoa dugunez hobetu daitekeela, eta uste osoa dugunez hobeto joka zitekeela, horregatik nahi dugu eta horregatik planteatzen dugu hutsegiteak aztertzea, gogoeta egitea, eta elementu zuzentzaileak martxan jartzea horrelako egoerarik berriro gerta ez dadin. Eta, begira, sailarekin batera dispositiboari babesa eman dioten bakarrak dira, hain justu, Ertzaintzaren jokaerak inoiz babesten ez dituztenak; lehen esan dudanez, Ertzaintzaren kontrako ekintzak ere gaitzetsi ez dituztenak. Eta horrek kezkatu egiten gaitu. Nik ez dakit zu, Iturrate jauna, eta sailburu andrea kezkatzen zaituzten; baina, (Date: 03.04.2014)',
+        '[TOPIC: Interpelazioa, Javier Ruiz de Arbulo Cerio Euskal Talde Popularreko legebiltzarkideak Segurtasuneko sailburuari egina, Arabako Miñoien Atalari buruz]\n[SEGURTASUNEKO SAILBURUAK (BELTRÁN DE HEREDIA ARRONIZ), (EA-NV)]:\nhoriek aldatu egiten dira egun batetik bestera, unitate batetik bestera, kontuan hartuta zer bilakaera duten erretiroek, kontuan hartuta nola gertatzen diren baja horiek… Baina, batez ere, nik bezain ondo dakizu Ertzaintzan defizit handia daukagula, eta ezin hobeto dakizu zergatia zein den. Ez dakit defizit horren zergatia zein den errepika diezazudan etorri zaren hona, baina ez daukat inolako eragozpenik Legebiltzar honetan berriro azaltzeko eta zuek berriro entzun behar izateko. Honela gaude Espainiako Gobernuak, Alderdi Popularraren Gobernuak, denbora asko behar izan zuelako, denbora gehiegi, zuk behar izan duzun be- zala, ulertzeko premia geneukala Ertzaintzan gertatzen ari ziren erretiro-bajak estaltzeko promozio berriak deitzeko –gero eta gehiago dira erretiroak eragindako bajak–; logikoa denez, baja horiek eragina zeukaten eta daukate Miñoien Atalean ere, bajak oraindik ere gertatzen ari baitira. 26. promozioa hautatzeko prozesua urtebete baino gehiago atzeratu da, errekurtsoek mehatxatu egin zituztelako 25. promozioaren bilakaera normala eta amaiera. Nik uste dut orain bide onetik goazela, baina ez duzu ahaztu behar promozio baten deialdia egiten dugunetik agenteak kalera irteten diren arte bi urte baino gehiago igarotzen direla. Bi urte baino gehiago. Eta ziztu bizian ibili ginen, betoa amaitu orduko azterketak egiteko: hogei egun eskas behar izan genituen 26. promozioko azterketen deialdia egiteko. Ziztu bizian ibili ginen, baina, hala ere, kale. Denbora eman behar da, ezta? Hemen, urdaiazpikoekin bezala geratzen da: denbora eman behar zaie, ontzeko. Bada, (Date: 01.12.2017)',
+        '[TOPIC: Galdera, Gorka Maneiro Labayen Mistoa-UPyD taldeko legebiltzarkideak lehendakariari egina, Eusko Jaurlaritzako Komunikazio zuzendariaren dimisioaren ondoren hartu beharreko erantzukizun politikoei buruz]\n[MANEIRO LABAYEN, (Mixto-UPyD)]:\nsailburu jakin batzuei elkarrizketak egitearen truke? Erantzun ahal diezaiokezu galdera horri? Halaxe da, bai. Zure esanetan, ez dago ezer arrarorik eta irregularrik, baina pertsona batek dimititu egin du. Zer egiteko asmoa duzu zuk? Bide batez, zer da aldi baterako dimisioaren kontu hori? Beste postu batean jarri al duzue pertsona hori? Diru publikoa kobratzen jarraitzen al du? Argitu dezakezu, edo herritarrak engainatu nahi dituzue? Pertsona horrek dimititu egin du. Zer egiteko (Date: 30.10.2015)',
+        '[TOPIC: Mozioa, Maddalen Iriarte Okiñena EH Bildu taldeko legebiltzarkideak aurkeztua, herri bezala ditugun erronka estrategikoei erantzuteko, herri-jakintza aktibatzeko eta ariketa kolektibo bat egiteko beharraren inguruan. Eztabaida eta behin betiko ebazpena]\n[BARRIO BAROJA, (PV-ETP)]:\nasko; eta ezin dela horren autokonplazientea izan eta dena positiboki egin dela esan. Argi dago, Iriarte andrea, amaitzeko, etorkizuneko erronkak ditugula; ados gaude gogor lan egin behar dela; baina estatus berria herritarrei arazo gehiago sortzea da; hura agerian jartzea eta hona ekartzea, berriz ere konfrontazio- eta eztabaida-eremu izatea da, herritarrei arazo gehiago sortzea da. Atzo argi eta garbi zioen euskal Soziometroak euskal herritarrok independentziari buruz zer iritzi dugu; eta inoiz ez da hain maila baxurik ikusi. Beraz, ildo horretan, erronka estrategikoei buruz hitz egiten ari zaren une honetan, estatus berriaren eztabaida hona ekartzea atzerapausoa litzateke, arazo gehiago ematea litzateke; eta, jakina, gu –zuri erantzuten dizut, baita orain hura aldarrikatu duen Egibar jaunari ere esaten diot– aurka egongo gara. Eskerrik asko. (Date: 10.06.2021)',
+        '[TOPIC: Galdera, Rebeka Ubera Aranzeta EH Bildu taldeko legebiltzarkideak Kultura eta Hizkuntza Politikako sailburuari egina, euskararen normalizazioan sakontzeko neurri funtsezkoak hartzeari buruz]\n[UBERA ARANZETA, (EH Bildu)]:\nAdministrazioa euskalduntzeko urratsak emango zirela: ekarpenak egin ditugu eta ezezkoa jaso dugu. Esan zitzaigun euskara ikastea doako bilakatzeko urratsak emango zirela, eta mugak besterik ez dugu ikusi eta ezezkoa jaso dugu. Eta jada dagoeneko zalantzan jartzen hasiak gara Gobernu honen borondate politikoa zein den. Eta, legegintzaldi honetan, sailburuen aldetik ere, atzerakada izugarria izan da, aurreko legegintzaldiarekin konparatuta –nabarmen gainera–, eta zentzu horretan ere, zerbait egin beharko duzu. Neurtzen ari (Date: 19.05.2017)',
+    ]
+)
+# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Evaluation
+### Metrics
+#### Cross Encoder Reranking
+* Dataset: `jina-reranker-v2-base-multilingual-contrastive-parl-4-10ep`
+* Evaluated with [<code>CrossEncoderRerankingEvaluator</code>](https://sbert.net/docs/package_reference/cross_encoder/evaluation.html#sentence_transformers.cross_encoder.evaluation.CrossEncoderRerankingEvaluator) with these parameters:
+  ```json
+  {
+      "at_k": 10,
+      "always_rerank_positives": false
+  }
+  ```
+| Metric      | Value                |
+|:------------|:---------------------|
+| map         | 0.0194 (+0.0172)     |
+| mrr@10      | 0.0194 (+0.0176)     |
+| **ndcg@10** | **0.0198 (+0.0172)** |
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 3,200 training samples
+* Columns: <code>query</code> and <code>positive</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | query                                                                                          | positive                                                                                           |
+  |:--------|:-----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------|
+  | type    | string                                                                                         | string                                                                                             |
+  | details | <ul><li>min: 27 characters</li><li>mean: 99.5 characters</li><li>max: 250 characters</li></ul> | <ul><li>min: 569 characters</li><li>mean: 975.13 characters</li><li>max: 2175 characters</li></ul> |
+* Samples:
+  | query                                                                                                                                                              | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+  |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Zein urtetan egin zuen José Ramón Becerra Carollo legebiltzarkideak SOS Deiak-112 larrialdi-deien arretarako zerbitzuaren esleipenari buruzko mozioa?</code> | <code>[TOPIC: Mozioa, José Ramón Becerra Carollo Elkarrekin Podemos taldeko legebiltzarkideak aurkeztua, SOS Deiak-112 larrialdi-deien arretarako zerbitzuaren esleipenari buruz. Eztabaida eta behin betiko ebazpena]<br>[LATXAGA UGARTEMENDIA, (EA-NV)]:<br>eta gero Sabin Etxearekin, Eliza Katolikoarekin, Xabier Arzalluzekin eta Eusko Jaurlaritzarekin berarekin lotu zenuen enpresa esleipenduna. Konspirazio perfektua lortzeko, Mosad eta BBVA falta zitzaizkizun, nik uste. Mesedez, ez erabili Ganbera hau gure eserlekuen gainean zikinkeria, zaborra botatzeko. Ez erabili horretarako, onbidezko gauzetarako baizik. Eta ez egin funtsik gabe, inolako frogarik gabe. Zuk esaten zenuena oso larria zen, oso larria, eta ezin duzu hemen tribuna honetan besterik gabe (Date: 21.12.2017)</code> |
+  | <code>Zergatik da beharrezkoa kargudun publikoen jokaera kodea arautzea?</code>                                                                                    | <code>[TOPIC: Euskal Sozialistak legebiltzar-taldeak egindako lege-proposamena, Kargudun Publikoaren Jokaera Kodea eta haren Bateraezintasunen Erregimena arautzeko. Aintzat hartzeari buruzko eztabaida eta behin betiko ebazpena]<br>[MINTEGI LAKARRA, (EH Bildu)]:<br>Egun on, presidente andrea, lehendakari jauna, legebiltzarkideok. Legerik onena da behar ez dena eta arautu beharra dagoenean hor badago ja gabeziaren sintoma, edo ez dagoelako adostasunik edo jokaera desegokiak egon direlako eta horiek saihestu behar direlako eta ez da ikusi beste biderik arautu beharra baino. Beraz, orain kargu publikoen jokaera etikoa edo jokaera kodea arautu beharrak adierazten digu badagoela gabezia, horren sintoma da. Izatez, jokaera zuzena berezkoa izan beharko (Date: 28.02.2013)</code>    |
+  | <code>Zein da EH Bildu talde parlamentarioaren jarrera Ikuskizunen eta Jolas Jardueren Legea garatzeko erregelamenduaren inguruan?</code>                          | <code>[TOPIC: EH Bildu talde parlamentarioak egindako legez besteko proposamena, Ikuskizunen eta Jolas Jardueren Legea garatzeko erregelamenduaren inguruan. Eztabaida eta behin betiko ebazpena]<br>[ÁLVAREZ MARTÍNEZ, (EA-NV)]:<br>mintzaldian aipatu ditugun puntuak zehaztu behar ditugun. Uste dugu, erantzukizunetik, dekretu hori berrikusi egin behar dela, eta uste dugu dagoeneko abian dela berrikuspen-prozesu hori, Eudelekin batera, udalek dituzten ikuspegiekin batera. Puntu honetan, gogoratu behar da Eudelen kolore guzti-guztietako udalak daudela ordezkatuta, eta kontuan hartu behar da, halaber, udal horiek guztiek zer iritzi duten eta zer ikuspuntu duten. Sémper jauna, nik ere uste dut –esperientzia handirik ez daukat, baina (Date: 14.03.2019)</code>                        |
+* Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 10.0,
+      "num_negatives": null,
+      "activation_fn": "torch.nn.modules.activation.Sigmoid",
+      "mini_batch_size": 16
+  }
+  ```
+### Evaluation Dataset
+#### Unnamed Dataset
+* Size: 800 evaluation samples
+* Columns: <code>query</code> and <code>positive</code>
+* Approximate statistics based on the first 800 samples:
+  |         | query                                                                                            | positive                                                                                            |
+  |:--------|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------|
+  | type    | string                                                                                           | string                                                                                              |
+  | details | <ul><li>min: 32 characters</li><li>mean: 102.26 characters</li><li>max: 247 characters</li></ul> | <ul><li>min: 550 characters</li><li>mean: 1011.95 characters</li><li>max: 2370 characters</li></ul> |
+* Samples:
+  | query                                                                                                                        | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+  |:-----------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Zer gertatu zen martxoaren 3an Euskal Autonomia Erkidegoan?</code>                                                     | <code>[TOPIC: Honako ekimen hauek batera eztabaidatu eta behin betiko ebazpena hartzea: ]<br>[UNZALU HERMOSA, (SV-ES)]:<br>Sekula. Gertatzen dena da uste dugula martxoaren 3ko jokaerak baduela zer hobetua. Eta hobetzeko abiapuntu bakarra gogoeta egitea da, aztertzea eta hasieratik aitortzea hutsegiteak egin zirela. Izan ere, nire lehenengo hitzaldian esan dudanez, triskantzak gertatu izanak pentsarazi behar liguke zerbaitek huts egin zuela egun hartako dispositiboa edo operazioa planifikatzean eta zuzentzean. Horixe sartu nahi dugu guk: eztabaida-elementuak, hobekuntzarako kritika-elementuak, eta UPyDrekin eta Alderdi Popularrarekin sinatu dugun zuzenketan hori esaten da, onar dadila gauzak hobetu egin daitezkeela. Izan ere, Iturrate jauna, zuk egin dizkiguzun galderei nik beste batzuekin erantzungo nieke. Posible da hutsegiteetatik ikastea eta herritarren segurtasuna hobetzea? Posible da? Edo, besterik gabe, "Ahal zen modu bakarrean jokatu dugu" esatera mugatu behar dugu? Posible da herritarrei k...</code> |
+  | <code>Zenbat denbora behar da Ertzaintzako promozio baten deialdia egiten denetik agenteak kalera irteten diren arte?</code> | <code>[TOPIC: Interpelazioa, Javier Ruiz de Arbulo Cerio Euskal Talde Popularreko legebiltzarkideak Segurtasuneko sailburuari egina, Arabako Miñoien Atalari buruz]<br>[SEGURTASUNEKO SAILBURUAK (BELTRÁN DE HEREDIA ARRONIZ), (EA-NV)]:<br>horiek aldatu egiten dira egun batetik bestera, unitate batetik bestera, kontuan hartuta zer bilakaera duten erretiroek, kontuan hartuta nola gertatzen diren baja horiek… Baina, batez ere, nik bezain ondo dakizu Ertzaintzan defizit handia daukagula, eta ezin hobeto dakizu zergatia zein den. Ez dakit defizit horren zergatia zein den errepika diezazudan etorri zaren hona, baina ez daukat inolako eragozpenik Legebiltzar honetan berriro azaltzeko eta zuek berriro entzun behar izateko. Honela gaude Espainiako Gobernuak, Alderdi Popularraren Gobernuak, denbora asko behar izan zuelako, denbora gehiegi, zuk behar izan duzun be- zala, ulertzeko premia geneukala Ertzaintzan gertatzen ari ziren erretiro-bajak estaltzeko promozio berriak deitzeko –gero eta gehiago dira erretiro...</code> |
+  | <code>Zergatik dimititu zuen Eusko Jaurlaritzako Komunikazio zuzendariak?</code>                                             | <code>[TOPIC: Galdera, Gorka Maneiro Labayen Mistoa-UPyD taldeko legebiltzarkideak lehendakariari egina, Eusko Jaurlaritzako Komunikazio zuzendariaren dimisioaren ondoren hartu beharreko erantzukizun politikoei buruz]<br>[MANEIRO LABAYEN, (Mixto-UPyD)]:<br>sailburu jakin batzuei elkarrizketak egitearen truke? Erantzun ahal diezaiokezu galdera horri? Halaxe da, bai. Zure esanetan, ez dago ezer arrarorik eta irregularrik, baina pertsona batek dimititu egin du. Zer egiteko asmoa duzu zuk? Bide batez, zer da aldi baterako dimisioaren kontu hori? Beste postu batean jarri al duzue pertsona hori? Diru publikoa kobratzen jarraitzen al du? Argitu dezakezu, edo herritarrak engainatu nahi dituzue? Pertsona horrek dimititu egin du. Zer egiteko (Date: 30.10.2015)</code>                                                                                                                                                                                                                                                                |
+* Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 10.0,
+      "num_negatives": null,
+      "activation_fn": "torch.nn.modules.activation.Sigmoid",
+      "mini_batch_size": 16
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `eval_strategy`: steps
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `learning_rate`: 2e-05
+- `num_train_epochs`: 10
+- `warmup_ratio`: 0.1
+- `load_best_model_at_end`: True
+- `batch_sampler`: no_duplicates
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: steps
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 2e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 10
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: True
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `parallelism_config`: None
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `hub_revision`: None
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `liger_kernel_config`: None
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: no_duplicates
+- `multi_dataset_batch_sampler`: proportional
+- `router_mapping`: {}
+- `learning_rate_mapping`: {}
+</details>
+### Training Logs
+| Epoch   | Step    | Training Loss | Validation Loss | jina-reranker-v2-base-multilingual-contrastive-parl-4-10ep_ndcg@10 |
+|:-------:|:-------:|:-------------:|:---------------:|:------------------------------------------------------------------:|
+| **1.0** | **200** | **0.0644**    | **0.0238**      | **0.0200 (+0.0175)**                                               |
+| 2.0     | 400     | 0.0238        | 0.0220          | 0.0198 (+0.0172)                                                   |
+| 3.0     | 600     | 0.0182        | 0.0231          | 0.0200 (+0.0175)                                                   |
+| 4.0     | 800     | 0.0167        | 0.0235          | 0.0198 (+0.0172)                                                   |
+| 5.0     | 1000    | 0.0123        | 0.0240          | 0.0198 (+0.0172)                                                   |
+| 6.0     | 1200    | 0.0123        | 0.0260          | 0.0198 (+0.0172)                                                   |
+| 7.0     | 1400    | 0.0133        | 0.0260          | 0.0198 (+0.0172)                                                   |
+| 8.0     | 1600    | 0.0143        | 0.0258          | 0.0198 (+0.0172)                                                   |
+| 9.0     | 1800    | 0.0136        | 0.0258          | 0.0198 (+0.0172)                                                   |
+| 10.0    | 2000    | 0.0135        | 0.0257          | 0.0198 (+0.0172)                                                   |
+* The bold row denotes the saved checkpoint.
+### Framework Versions
+- Python: 3.9.7
+- Sentence Transformers: 5.0.0
+- Transformers: 4.56.0
+- PyTorch: 2.7.1+cu126
+- Accelerate: 1.5.2
+- Datasets: 4.0.0
+- Tokenizers: 0.22.0
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

block.py ADDED Viewed

	@@ -0,0 +1,470 @@

+# This implementation was adapted from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/block.py
+# Commit id: abbc1311731867310635f9edc2a9ec18317c8c48
+# Copyright (c) 2024, Tri Dao.
+from functools import partial
+from typing import Optional
+import torch
+import torch.fx
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+from .mha import MHA
+from .mlp import Mlp
+try:
+    from flash_attn.ops.triton.layer_norm import layer_norm_fn, RMSNorm
+except ImportError:
+    layer_norm_fn, RMSNorm = None, None
+def stochastic_depth(
+    input: Tensor, p: float, mode: str, training: bool = True
+) -> Tensor:
+    """
+    Implements the Stochastic Depth from `"Deep Networks with Stochastic Depth"
+    <https://arxiv.org/abs/1603.09382>`_ used for randomly dropping residual
+    branches of residual architectures.
+    Args:
+        input (Tensor[N, ...]): The input tensor or arbitrary dimensions with the first one
+                    being its batch i.e. a batch with ``N`` rows.
+        p (float): probability of the input to be zeroed.
+        mode (str): ``"batch"`` or ``"row"``.
+                    ``"batch"`` randomly zeroes the entire input, ``"row"`` zeroes
+                    randomly selected rows from the batch.
+        training: apply stochastic depth if is ``True``. Default: ``True``
+    Returns:
+        Tensor[N, ...]: The randomly zeroed tensor.
+    """
+    if p < 0.0 or p > 1.0:
+        raise ValueError(f"drop probability has to be between 0 and 1, but got {p}")
+    if mode not in ["batch", "row"]:
+        raise ValueError(f"mode has to be either 'batch' or 'row', but got {mode}")
+    if not training or p == 0.0:
+        return input
+    survival_rate = 1.0 - p
+    if mode == "row":
+        size = [input.shape[0]] + [1] * (input.ndim - 1)
+    else:
+        size = [1] * input.ndim
+    noise = torch.empty(size, dtype=input.dtype, device=input.device)
+    noise = noise.bernoulli_(survival_rate)
+    if survival_rate > 0.0:
+        noise.div_(survival_rate)
+    return input * noise
+torch.fx.wrap("stochastic_depth")
+class StochasticDepth(nn.Module):
+    """
+    See :func:`stochastic_depth`.
+    """
+    def __init__(self, p: float, mode: str) -> None:
+        super().__init__()
+        self.p = p
+        self.mode = mode
+    def forward(self, input: Tensor) -> Tensor:
+        return stochastic_depth(input, self.p, self.mode, self.training)
+    def __repr__(self) -> str:
+        s = f"{self.__class__.__name__}(p={self.p}, mode={self.mode})"
+        return s
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim,
+        mixer_cls=None,
+        mlp_cls=None,
+        norm_cls=nn.LayerNorm,
+        dropout_cls=nn.Dropout,
+        prenorm=True,
+        resid_dropout1=0.0,
+        resid_dropout2=0.0,
+        drop_path1=0.0,
+        drop_path2=0.0,
+        fused_dropout_add_ln=False,
+        return_residual=False,
+        residual_in_fp32=False,
+        sequence_parallel=False,
+        mark_shared_params=False,
+    ):
+        """
+        For prenorm=True, this Block has a slightly different structure compared to a regular
+        prenorm Transformer block.
+        The standard block is: LN -> MHA -> Dropout -> Add -> LN -> MLP -> Dropout -> Add.
+        [Ref: https://arxiv.org/abs/2002.04745]
+        Here we have: Dropout -> Add -> LN -> MHA -> Dropout -> Add -> LN -> MLP, returning both
+        the hidden_states (output of the MLP) and the residual.
+        This is for performance reasons, as we can fuse the dropout, add and LayerNorm.
+        The residual needs to be provided (except for the very first block).
+        For prenorm=False, this Block has the same structure as a regular postnorm Transformer
+        block: MHA -> Dropout -> Add -> LN -> MLP -> Dropout -> Add -> LN.
+        return_residual: whether each of the sub-layers (mixer and mlp) will return the residual.
+        This is for performance reason: for post-norm architecture, returning the input allows us
+        to fuse the backward of nn.Linear with the residual connection.
+        """
+        super().__init__()
+        self.prenorm = prenorm
+        self.fused_dropout_add_ln = fused_dropout_add_ln
+        self.return_residual = return_residual
+        self.residual_in_fp32 = residual_in_fp32
+        if self.residual_in_fp32:
+            assert self.prenorm, "residual_in_fp32 is only compatible with prenorm=True"
+        if mixer_cls is None:
+            mixer_cls = partial(MHA, num_heads=dim // 64)
+        if mlp_cls is None:
+            mlp_cls = partial(Mlp, hidden_features=4 * dim)
+        self.mixer = mixer_cls(dim)
+        self.dropout1 = dropout_cls(resid_dropout1)
+        self.drop_path1 = StochasticDepth(drop_path1, mode="row")
+        self.norm1 = norm_cls(dim)
+        self.mlp = mlp_cls(dim)
+        if not isinstance(self.mlp, nn.Identity):
+            self.dropout2 = dropout_cls(resid_dropout2)
+            self.drop_path2 = StochasticDepth(drop_path2, mode="row")
+            self.norm2 = norm_cls(dim)
+        if self.fused_dropout_add_ln:
+            assert layer_norm_fn is not None, "Triton is not installed"
+            assert isinstance(self.norm1, (nn.LayerNorm, RMSNorm)) and isinstance(
+                self.dropout1, nn.Dropout
+            )
+        # TD [2023-01-07]: TODO: During training, if sequence_parallel is False and dropout != 0.0,
+        # then the input to each worker in the tensor parallel group will be different.
+        # This would produce wrong outputs? Somehow we'd need to sync the RNG state across workers.
+        # For now this is not an issue because we always use sequence_parallel=True during training
+        # and only use sequence_parallel=False during inference.
+        # Mark the norm parameters as "sequence_parallel" so that we run all-reduce on their grads.
+        if sequence_parallel:
+            for p in self.norm1.parameters():
+                p._sequence_parallel = True
+            if hasattr(self, "norm2"):
+                for p in self.norm2.parameters():
+                    p._sequence_parallel = True
+        # Mark the norm parameters as "shared_params" so that we sync their values at init.
+        if mark_shared_params:
+            for p in self.norm1.parameters():
+                p._shared_params = True
+            if hasattr(self, "norm2"):
+                for p in self.norm2.parameters():
+                    p._shared_params = True
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        return self.mixer.allocate_inference_cache(
+            batch_size, max_seqlen, dtype=dtype, **kwargs
+        )
+    def forward(
+        self,
+        hidden_states: Tensor,
+        residual: Optional[Tensor] = None,
+        mixer_subset=None,
+        mixer_kwargs=None,
+    ):
+        r"""Pass the input through the encoder layer.
+        Args:
+            hidden_states: the sequence to the encoder layer (required).
+            residual: if postnorm, residual=None, If prenorm, hidden_states = Attn/MLP(LN(residual))
+            mixer_subset: for cross-attention only. If not None, will take a subset of x
+                before applying the query projection. Useful for e.g., ViT where we only care
+                about the CLS token in the last layer.
+        """
+        if self.prenorm:
+            if not self.fused_dropout_add_ln:
+                dropped = self.drop_path1(self.dropout1(hidden_states))
+                residual = (dropped + residual) if residual is not None else dropped
+                hidden_states = self.norm1(residual.to(dtype=self.norm1.weight.dtype))
+                if self.residual_in_fp32:
+                    residual = residual.to(torch.float32)
+            else:
+                if self.drop_path1.p == 0 or not self.training:
+                    rowscale1 = None
+                else:
+                    rowscale1 = self.drop_path1(
+                        torch.ones(
+                            hidden_states.shape[:-1],
+                            device=hidden_states.device,
+                            dtype=hidden_states.dtype,
+                        )
+                    )
+                hidden_states, residual = layer_norm_fn(
+                    hidden_states,
+                    self.norm1.weight,
+                    self.norm1.bias,
+                    residual=residual,
+                    eps=self.norm1.eps,
+                    dropout_p=self.dropout1.p if self.training else 0.0,
+                    rowscale=rowscale1,
+                    prenorm=True,
+                    residual_in_fp32=self.residual_in_fp32,
+                    is_rms_norm=isinstance(self.norm1, RMSNorm),
+                )
+            if mixer_kwargs is None:
+                mixer_kwargs = {}
+            if mixer_subset is not None:
+                mixer_kwargs["mixer_subset"] = mixer_subset
+            hidden_states = self.mixer(hidden_states, **mixer_kwargs)
+            if mixer_subset is not None:
+                residual = residual[:, mixer_subset]
+            if not isinstance(self.mlp, nn.Identity):
+                if not self.fused_dropout_add_ln:
+                    dropped = self.drop_path2(self.dropout2(hidden_states))
+                    residual = (dropped + residual) if residual is not None else dropped
+                    hidden_states = self.norm2(
+                        residual.to(dtype=self.norm2.weight.dtype)
+                    )
+                    if self.residual_in_fp32:
+                        residual = residual.to(torch.float32)
+                else:
+                    if self.drop_path2.p == 0 or not self.training:
+                        rowscale2 = None
+                    else:
+                        rowscale2 = self.drop_path2(
+                            torch.ones(
+                                hidden_states.shape[:-1],
+                                device=hidden_states.device,
+                                dtype=hidden_states.dtype,
+                            )
+                        )
+                    hidden_states, residual = layer_norm_fn(
+                        hidden_states,
+                        self.norm2.weight,
+                        self.norm2.bias,
+                        residual=residual,
+                        eps=self.norm2.eps,
+                        dropout_p=self.dropout2.p if self.training else 0.0,
+                        rowscale=rowscale2,
+                        prenorm=True,
+                        residual_in_fp32=self.residual_in_fp32,
+                        is_rms_norm=isinstance(self.norm2, RMSNorm),
+                    )
+                hidden_states = self.mlp(hidden_states)
+            return hidden_states, residual
+        else:
+            assert residual is None
+            mixer_out = self.mixer(
+                hidden_states, **(mixer_kwargs if mixer_kwargs is not None else {})
+            )
+            if self.return_residual:  # mixer out is actually a pair here
+                mixer_out, hidden_states = mixer_out
+            if not self.fused_dropout_add_ln:
+                hidden_states = self.norm1(
+                    (self.drop_path1(self.dropout1(mixer_out)) + hidden_states).to(
+                        dtype=self.norm1.weight.dtype
+                    )
+                )
+            else:
+                if self.drop_path1.p == 0 or not self.training:
+                    rowscale1 = None
+                else:
+                    rowscale1 = self.drop_path1(
+                        torch.ones(
+                            mixer_out.shape[:-1],
+                            device=mixer_out.device,
+                            dtype=mixer_out.dtype,
+                        )
+                    )
+                hidden_states = layer_norm_fn(
+                    mixer_out,
+                    self.norm1.weight,
+                    self.norm1.bias,
+                    residual=hidden_states,
+                    eps=self.norm1.eps,
+                    dropout_p=self.dropout1.p if self.training else 0.0,
+                    rowscale=rowscale1,
+                    prenorm=False,
+                    is_rms_norm=isinstance(self.norm1, RMSNorm),
+                )
+            if not isinstance(self.mlp, nn.Identity):
+                mlp_out = self.mlp(hidden_states)
+                if self.return_residual:  # mlp out is actually a pair here
+                    mlp_out, hidden_states = mlp_out
+                if not self.fused_dropout_add_ln:
+                    hidden_states = self.norm2(
+                        (self.drop_path2(self.dropout2(mlp_out)) + hidden_states).to(
+                            dtype=self.norm2.weight.dtype
+                        )
+                    )
+                else:
+                    if self.drop_path2.p == 0 or not self.training:
+                        rowscale2 = None
+                    else:
+                        rowscale2 = self.drop_path2(
+                            torch.ones(
+                                mlp_out.shape[:-1],
+                                device=mlp_out.device,
+                                dtype=mlp_out.dtype,
+                            )
+                        )
+                    hidden_states = layer_norm_fn(
+                        mlp_out,
+                        self.norm2.weight,
+                        self.norm2.bias,
+                        residual=hidden_states,
+                        eps=self.norm2.eps,
+                        dropout_p=self.dropout2.p if self.training else 0.0,
+                        rowscale=rowscale2,
+                        prenorm=False,
+                        is_rms_norm=isinstance(self.norm2, RMSNorm),
+                    )
+            return hidden_states
+class ParallelBlock(nn.Module):
+    """The attention (mixer) and MLP blocks are done in parallel, similar to GPT-J, GPT-NeoX,
+    and PaLM.
+    """
+    def __init__(
+        self,
+        dim,
+        mixer_cls=None,
+        mlp_cls=None,
+        norm_cls=nn.LayerNorm,
+        dropout_cls=nn.Dropout,
+        resid_dropout1=0.0,
+        resid_dropout2=0.0,
+        tied_norm=False,
+        fused_dropout_add_ln=False,
+        residual_in_fp32=False,
+        sequence_parallel=False,
+        mark_shared_params=False,
+    ):
+        """
+        This Block has a slightly different structure compared to a regular
+        prenorm Transformer block.
+        The standard block is: LN -> MHA / MLP -> Dropout -> Add.
+        [Ref: https://arxiv.org/abs/2002.04745]
+        Here we have: Dropout -> Add -> LN -> MHA / MLP, returning both
+        the hidden_states (output1 of the MHA / MLP) and the residual.
+        This is for performance reasons, as we can fuse the dropout, add and LayerNorm.
+        The residual needs to be provided (except for the very first block).
+        """
+        super().__init__()
+        self.tied_norm = tied_norm
+        self.fused_dropout_add_ln = fused_dropout_add_ln
+        self.residual_in_fp32 = residual_in_fp32
+        if mixer_cls is None:
+            mixer_cls = partial(MHA, num_heads=dim // 64)
+        if mlp_cls is None:
+            mlp_cls = partial(Mlp, hidden_features=4 * dim)
+        self.mixer = mixer_cls(dim)
+        self.dropout1 = dropout_cls(resid_dropout1)
+        self.norm1 = norm_cls(dim)
+        self.mlp = mlp_cls(dim)
+        self.dropout2 = dropout_cls(resid_dropout2)
+        if not self.tied_norm:
+            self.norm2 = norm_cls(dim)
+        if self.fused_dropout_add_ln:
+            assert layer_norm_fn is not None, "Triton is not installed"
+            assert isinstance(self.norm1, (nn.LayerNorm, RMSNorm)) and isinstance(
+                self.dropout1, nn.Dropout
+            )
+        # TD [2023-01-07]: TODO: During training, if sequence_parallel is False and dropout != 0.0,
+        # then the input to each worker in the tensor parallel group will be different.
+        # This would produce wrong outputs? Somehow we'd need to sync the RNG state across workers.
+        # For now this is not an issue because we always use sequence_parallel=True during training
+        # and only use sequence_parallel=False during inference.
+        # Mark the norm parameters as "sequence_parallel" so that we run all-reduce on their grads.
+        if sequence_parallel:
+            for p in self.norm1.parameters():
+                p._sequence_parallel = True
+            if hasattr(self, "norm2"):
+                for p in self.norm2.parameters():
+                    p._sequence_parallel = True
+        # Mark the norm parameters as "shared_params" so that we sync their values at init.
+        if mark_shared_params:
+            for p in self.norm1.parameters():
+                p._shared_params = True
+            if hasattr(self, "norm2"):
+                for p in self.norm2.parameters():
+                    p._shared_params = True
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        return self.mixer.allocate_inference_cache(
+            batch_size, max_seqlen, dtype=dtype, **kwargs
+        )
+    def forward(
+        self,
+        hidden_states1: Tensor,
+        hidden_states2: Optional[Tensor] = None,
+        residual: Optional[Tensor] = None,
+        mixer_kwargs=None,
+    ):
+        r"""Pass the input through the encoder layer.
+        Args:
+            hidden_states1: the output of the previous attention (mixer) or embedding layer.
+            hidden_states2: the output of the previous MLP layer (if None, will use hidden_states1).
+            residual.
+        """
+        # TODO: Ideally we should only do the allgather / allreduce once for
+        # the Linear to MLP & Attention
+        if not self.fused_dropout_add_ln:
+            dropped1 = self.dropout1(hidden_states1)
+            # For the very 1st block, we only want 1 dropout, not two different dropouts
+            if hidden_states2 is not None:
+                dropped2 = self.dropout2(hidden_states2)
+                residual = (
+                    (residual + dropped1 + dropped2)
+                    if residual is not None
+                    else dropped1 + dropped2
+                )
+            else:
+                residual = (residual + dropped1) if residual is not None else dropped1
+            hidden_states1 = self.norm1(residual.to(dtype=self.norm1.weight.dtype))
+            hidden_states2 = (
+                self.norm2(residual.to(dtype=self.norm2.weight.dtype))
+                if not self.tied_norm
+                else hidden_states1
+            )
+            if self.residual_in_fp32:
+                residual = residual.to(torch.float32)
+        else:
+            weight2, bias2 = (
+                (self.norm2.weight, self.norm2.bias)
+                if not self.tied_norm
+                else (None, None)
+            )
+            hidden_states1, *rest, residual = layer_norm_fn(
+                hidden_states1,
+                self.norm1.weight,
+                self.norm1.bias,
+                residual=residual,
+                x1=hidden_states2,
+                weight1=weight2,
+                bias1=bias2,
+                eps=self.norm1.eps,
+                dropout_p=self.dropout1.p if self.training else 0.0,
+                prenorm=True,
+                residual_in_fp32=self.residual_in_fp32,
+                is_rms_norm=isinstance(self.norm1, RMSNorm),
+            )
+            if self.tied_norm:
+                hidden_states2 = hidden_states1
+            else:
+                (hidden_states2,) = rest
+        if mixer_kwargs is None:
+            mixer_kwargs = {}
+        hidden_states1 = self.mixer(hidden_states1, **mixer_kwargs)
+        hidden_states2 = self.mlp(hidden_states2)
+        return hidden_states1, hidden_states2, residual

config.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "architectures": [
+    "XLMRobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "auto_map": {
+    "AutoConfig": "configuration_xlm_roberta.XLMRobertaFlashConfig",
+    "AutoModel": "modeling_xlm_roberta.XLMRobertaModel",
+    "AutoModelForSequenceClassification": "modeling_xlm_roberta.XLMRobertaForSequenceClassification"
+  },
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "bfloat16",
+  "emb_pooler": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "load_trained_adapters": false,
+  "lora_adaptations": null,
+  "lora_alpha": 1,
+  "lora_dropout_p": 0.0,
+  "lora_main_params_trainable": false,
+  "lora_rank": 4,
+  "matryoshka_dimensions": null,
+  "max_position_embeddings": 1026,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "sentence_transformers": {
+    "activation_fn": "torch.nn.modules.activation.Sigmoid",
+    "version": "5.0.0"
+  },
+  "transformers_version": "4.56.0",
+  "truncate_dim": null,
+  "type_vocab_size": 1,
+  "use_cache": false,
+  "use_flash_attn": true,
+  "vocab_size": 250002
+}

configuration_xlm_roberta.py ADDED Viewed

	@@ -0,0 +1,69 @@

+from transformers import PretrainedConfig
+import torch
+class XLMRobertaFlashConfig(PretrainedConfig):
+    def __init__(
+            self,
+            vocab_size=30522,
+            hidden_size=768,
+            num_hidden_layers=12,
+            num_attention_heads=12,
+            intermediate_size=3072,
+            hidden_act="gelu",
+            hidden_dropout_prob=0.1,
+            attention_probs_dropout_prob=0.1,
+            max_position_embeddings=512,
+            type_vocab_size=2,
+            initializer_range=0.02,
+            layer_norm_eps=1e-12,
+            pad_token_id=1,
+            bos_token_id=0,
+            eos_token_id=2,
+            position_embedding_type="absolute",
+            use_cache=True,
+            classifier_dropout=None,
+            lora_adaptations=None,
+            lora_rank=4,
+            lora_dropout_p=0.0,
+            lora_alpha=1,
+            lora_main_params_trainable=False,
+            load_trained_adapters=False,
+            use_flash_attn=True,
+            torch_dtype=None,
+            emb_pooler=None,
+            matryoshka_dimensions=None,
+            truncate_dim=None,
+            **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
+        self.load_trained_adapters = load_trained_adapters
+        self.lora_adaptations = lora_adaptations
+        self.lora_rank = lora_rank
+        self.lora_dropout_p = lora_dropout_p
+        self.lora_alpha = lora_alpha
+        self.lora_main_params_trainable = lora_main_params_trainable
+        self.use_flash_attn = use_flash_attn
+        self.emb_pooler = emb_pooler
+        self.matryoshka_dimensions = matryoshka_dimensions
+        self.truncate_dim = truncate_dim
+        if torch_dtype and hasattr(torch, torch_dtype) and type(getattr(torch, torch_dtype)) is torch.dtype:
+            self.torch_dtype = getattr(torch, torch_dtype)
+        else:
+            self.torch_dtype = torch_dtype

embedding.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# This implementation was adapted from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/embedding.py
+# Commit id: f1a73d074002226c42ce65a1df170ecff9f022c0
+# Copyright (c) 2022, Tri Dao.
+import torch
+import torch.nn as nn
+from einops import rearrange
+from torch import Tensor
+from transformers.models.xlm_roberta.modeling_xlm_roberta import create_position_ids_from_input_ids
+class XLMRobertaEmbeddings(nn.Module):
+    def __init__(
+        self,
+        embed_dim,
+        vocab_size,
+        max_position_embeddings,
+        type_vocab_size,
+        padding_idx=None,
+        device=None,
+        dtype=None,
+    ):
+        """
+        If max_position_embeddings <= 0, there's no position embeddings
+        If type_vocab_size <= 0, there's no token type embeddings
+        """
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            vocab_size, embed_dim, padding_idx=padding_idx, **factory_kwargs
+        )
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        if self.max_position_embeddings > 0:
+            self.position_embeddings = nn.Embedding(
+                max_position_embeddings, embed_dim, **factory_kwargs
+            )
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(type_vocab_size, embed_dim, **factory_kwargs)
+    def forward(self, input_ids, position_ids=None, token_type_ids=None):
+        """
+        input_ids: (batch, seqlen)
+        position_ids: (batch, seqlen)
+        token_type_ids: (batch, seqlen)
+        """
+        batch_size, seqlen = input_ids.shape
+        embeddings = self.word_embeddings(input_ids)
+        if self.max_position_embeddings > 0:
+            if position_ids is None:
+                position_ids = create_position_ids_from_input_ids(input_ids, padding_idx=self.word_embeddings.padding_idx).to(input_ids.device)
+                # position_ids = torch.arange(seqlen, dtype=torch.long, device=input_ids.device)
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = torch.zeros(seqlen, dtype=torch.long, device=input_ids.device)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        return embeddings

mha.py ADDED Viewed

	@@ -0,0 +1,662 @@

+# Copyright (c) 2023, Tri Dao.
+# Adapted from https://github.com/Dao-AILab/flash-attention/pull/556
+import math
+from functools import partial
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat
+try:
+    from flash_attn import (
+        flash_attn_kvpacked_func,
+        flash_attn_qkvpacked_func,
+        flash_attn_varlen_kvpacked_func,
+        flash_attn_varlen_qkvpacked_func,
+        flash_attn_with_kvcache,
+    )
+except ImportError:
+    flash_attn_varlen_qkvpacked_func, flash_attn_varlen_kvpacked_func = None, None
+    flash_attn_qkvpacked_func, flash_attn_kvpacked_func = None, None
+    flash_attn_with_kvcache = None
+try:
+    from flash_attn.ops.fused_dense import ColumnParallelLinear, FusedDense, RowParallelLinear
+except ImportError:
+    FusedDense, ColumnParallelLinear, RowParallelLinear = None, None, None
+class FlashSelfAttention(nn.Module):
+    """Implement the scaled dot product attention with softmax.
+    Arguments
+    ---------
+        softmax_scale: The temperature to use for the softmax attention.
+                      (default: 1/sqrt(d_keys) where d_keys is computed at
+                      runtime)
+        attention_dropout: The dropout rate to apply to the attention
+                           (default: 0.0)
+    """
+    def __init__(
+        self,
+        causal=False,
+        softmax_scale=None,
+        attention_dropout=0.0,
+        window_size=(-1, -1),
+        deterministic=False,
+    ):
+        super().__init__()
+        assert flash_attn_varlen_qkvpacked_func is not None, "FlashAttention is not installed"
+        assert flash_attn_qkvpacked_func is not None, "FlashAttention is not installed"
+        self.causal = causal
+        self.softmax_scale = softmax_scale
+        self.drop = nn.Dropout(attention_dropout)
+        self.window_size = window_size
+        self.deterministic = deterministic
+    def forward(self, qkv, causal=None, cu_seqlens=None, max_seqlen=None):
+        """Implements the multihead softmax attention.
+        Arguments
+        ---------
+            qkv: The tensor containing the query, key, and value.
+                If cu_seqlens is None and max_seqlen is None, then qkv has shape (B, S, 3, H, D).
+                If cu_seqlens is not None and max_seqlen is not None, then qkv has shape
+                (total, 3, H, D), where total is the sum of the sequence lengths in the batch.
+            causal: if passed, will override self.causal
+            cu_seqlens: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths
+                of the sequences in the batch, used to index into qkv.
+            max_seqlen: int. Maximum sequence length in the batch.
+        Returns:
+        --------
+            out: (total, H, D) if cu_seqlens is not None and max_seqlen is not None,
+                else (B, S, H, D).
+        """
+        assert qkv.dtype in [torch.float16, torch.bfloat16]
+        assert qkv.is_cuda
+        causal = self.causal if causal is None else causal
+        unpadded = cu_seqlens is not None
+        if unpadded:
+            assert cu_seqlens.dtype == torch.int32
+            assert max_seqlen is not None
+            assert isinstance(max_seqlen, int)
+            return flash_attn_varlen_qkvpacked_func(
+                qkv,
+                cu_seqlens,
+                max_seqlen,
+                self.drop.p if self.training else 0.0,
+                softmax_scale=self.softmax_scale,
+                causal=causal,
+                alibi_slopes=None,
+                window_size=self.window_size,
+                deterministic=self.deterministic,
+            )
+        else:
+            return flash_attn_qkvpacked_func(
+                qkv,
+                self.drop.p if self.training else 0.0,
+                softmax_scale=self.softmax_scale,
+                causal=causal,
+                alibi_slopes=None,
+                window_size=self.window_size,
+                deterministic=self.deterministic,
+            )
+class FlashCrossAttention(nn.Module):
+    """Implement the scaled dot product attention with softmax.
+    Arguments
+    ---------
+        softmax_scale: The temperature to use for the softmax attention.
+                      (default: 1/sqrt(d_keys) where d_keys is computed at
+                      runtime)
+        attention_dropout: The dropout rate to apply to the attention
+                           (default: 0.0)
+    """
+    def __init__(
+        self,
+        causal=False,
+        softmax_scale=None,
+        attention_dropout=0.0,
+        window_size=(-1, -1),
+        deterministic=False,
+    ):
+        super().__init__()
+        assert flash_attn_varlen_kvpacked_func is not None, "FlashAttention is not installed"
+        assert flash_attn_kvpacked_func is not None, "FlashAttention is not installed"
+        self.causal = causal
+        self.softmax_scale = softmax_scale
+        self.drop = nn.Dropout(attention_dropout)
+        self.window_size = window_size
+        self.deterministic = deterministic
+    def forward(
+        self,
+        q,
+        kv,
+        causal=None,
+        cu_seqlens=None,
+        max_seqlen=None,
+        cu_seqlens_k=None,
+        max_seqlen_k=None,
+    ):
+        """Implements the multihead softmax attention.
+        Arguments
+        ---------
+            q: The tensor containing the query. (B, Sq, H, D)
+            kv: The tensor containing the key and value. (B, Sk, 2, H_k, D)
+            causal: if passed, will override self.causal
+            cu_seqlens: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths
+                of the sequences in the batch, used to index into q.
+            max_seqlen: int. Maximum sequence length in the batch of q.
+            cu_seqlens_k: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths
+                of the sequences in the batch, used to index into kv.
+            max_seqlen_k: int. Maximum sequence length in the batch of k and v.
+        """
+        assert q.dtype in [torch.float16, torch.bfloat16]
+        assert q.is_cuda and kv.is_cuda
+        causal = self.causal if causal is None else causal
+        unpadded = cu_seqlens is not None
+        if unpadded:
+            assert cu_seqlens.dtype == torch.int32
+            assert max_seqlen is not None
+            assert isinstance(max_seqlen, int)
+            assert cu_seqlens_k is not None
+            assert cu_seqlens_k.dtype == torch.int32
+            assert max_seqlen_k is not None
+            assert isinstance(max_seqlen, int)
+            return flash_attn_varlen_kvpacked_func(
+                q,
+                kv,
+                cu_seqlens,
+                cu_seqlens_k,
+                max_seqlen,
+                max_seqlen_k,
+                self.drop.p if self.training else 0.0,
+                softmax_scale=self.softmax_scale,
+                causal=causal,
+                alibi_slopes=None,
+                window_size=self.window_size,
+                deterministic=self.deterministic,
+            )
+        else:
+            batch_size, seqlen_q = q.shape[0], q.shape[1]
+            seqlen_k = kv.shape[1]
+            assert kv.shape[0] == batch_size and kv.shape[4] == q.shape[3]
+            return flash_attn_kvpacked_func(
+                q,
+                kv,
+                self.drop.p if self.training else 0.0,
+                causal=causal,
+                softmax_scale=self.softmax_scale,
+                alibi_slopes=None,
+                window_size=self.window_size,
+                deterministic=self.deterministic,
+            )
+class SelfAttention(nn.Module):
+    """Implement the scaled dot product attention with softmax.
+    Arguments
+    ---------
+        softmax_scale: The temperature to use for the softmax attention.
+                      (default: 1/sqrt(d_keys) where d_keys is computed at
+                      runtime)
+        attention_dropout: The dropout rate to apply to the attention
+                           (default: 0.0)
+    """
+    def __init__(self, causal=False, softmax_scale=None, attention_dropout=0.0):
+        super().__init__()
+        self.causal = causal
+        self.softmax_scale = softmax_scale
+        self.drop = nn.Dropout(attention_dropout)
+    def forward(self, qkv, causal=None, key_padding_mask=None):
+        """Implements the multihead softmax attention.
+        Arguments
+        ---------
+            qkv: The tensor containing the query, key, and value. (B, S, 3, H, D)
+            causal: if passed, will override self.causal
+            key_padding_mask: boolean mask to apply to the attention weights. True means to keep,
+                False means to mask out. (B, S)
+        """
+        batch_size, seqlen = qkv.shape[0], qkv.shape[1]
+        causal = self.causal if causal is None else causal
+        q, k, v = qkv.unbind(dim=2)
+        softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
+        scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
+        if key_padding_mask is not None:
+            padding_mask = torch.full(
+                (batch_size, seqlen), -10000.0, dtype=scores.dtype, device=scores.device
+            )
+            padding_mask.masked_fill_(key_padding_mask, 0.0)
+            # TD [2022-09-30]: Adding is faster than masked_fill_ (idk why, just better kernel I guess)
+            scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
+        if causal:
+            # "triu_tril_cuda_template" not implemented for 'BFloat16'
+            # So we have to construct the mask in float
+            causal_mask = torch.triu(
+                torch.full((seqlen, seqlen), -10000.0, device=scores.device), 1
+            )
+            # TD [2022-09-30]: Adding is faster than masked_fill_ (idk why, just better kernel I guess)
+            scores = scores + causal_mask.to(dtype=scores.dtype)
+        attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
+        attention_drop = self.drop(attention)
+        output = torch.einsum("bhts,bshd->bthd", attention_drop, v)
+        return output
+class CrossAttention(nn.Module):
+    """Implement the scaled dot product attention with softmax.
+    Arguments
+    ---------
+        softmax_scale: The temperature to use for the softmax attention.
+                      (default: 1/sqrt(d_keys) where d_keys is computed at
+                      runtime)
+        attention_dropout: The dropout rate to apply to the attention
+                           (default: 0.0)
+    """
+    def __init__(self, causal=False, softmax_scale=None, attention_dropout=0.0):
+        super().__init__()
+        self.causal = causal
+        self.softmax_scale = softmax_scale
+        self.drop = nn.Dropout(attention_dropout)
+    def forward(self, q, kv, causal=None, key_padding_mask=None):
+        """Implements the multihead softmax attention.
+        Arguments
+        ---------
+            q: The tensor containing the query. (B, Sq, H, D)
+            kv: The tensor containing the key and value. (B, Sk, 2, H_k, D)
+            causal: if passed, will override self.causal
+            key_padding_mask: boolean mask to apply to the attention weights. True means to keep,
+                False means to mask out. (B, Sk)
+        """
+        batch_size, seqlen_q = q.shape[0], q.shape[1]
+        causal = self.causal if causal is None else causal
+        seqlen_k = kv.shape[1]
+        assert kv.shape[0] == batch_size and kv.shape[4] == q.shape[3]
+        if kv.shape[3] != q.shape[2]:  # MQA/GQA
+            kv = repeat(kv, "... hkv d -> ... (hkv g) d", g=q.shape[2] // kv.shape[3])
+        k, v = kv.unbind(dim=2)
+        softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
+        scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
+        if key_padding_mask is not None:
+            padding_mask = torch.full(
+                (batch_size, seqlen_k), -10000.0, dtype=scores.dtype, device=scores.device
+            )
+            padding_mask.masked_fill_(key_padding_mask, 0.0)
+            # TD [2022-09-30]: Adding is faster than masked_fill_ (idk why, just better kernel I guess)
+            scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
+        if causal:
+            # causal mask needs to take into account the difference between seqlen_q and seqlen_k
+            row_idx = rearrange(
+                torch.arange(seqlen_q, device=q.device, dtype=torch.long), "s -> s 1"
+            )
+            col_idx = torch.arange(seqlen_k, device=kv.device, dtype=torch.long)
+            sk = (
+                seqlen_k
+                if key_padding_mask is None
+                else rearrange(key_padding_mask.sum(-1), "b -> b 1 1 1")
+            )
+            causal_mask = col_idx > row_idx + sk - seqlen_q
+            scores = scores.masked_fill(causal_mask, -10000.0)
+        attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
+        attention_drop = self.drop(attention)
+        output = torch.einsum("bhts,bshd->bthd", attention_drop, v)
+        return output
+class LinearResidual(nn.Linear):
+    """Wrap nn.Linear to return the residual as well. For compatibility with FusedDense."""
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return super().forward(input), input
+def _update_kv_cache(kv, inference_params, layer_idx):
+    """kv: (batch_size, seqlen, 2, nheads, head_dim) or (batch_size, 1, 2, nheads, head_dim)"""
+    # Pre-allocate memory for key-values for inference.
+    num_heads, head_dim = kv.shape[-2:]
+    if layer_idx not in inference_params.key_value_memory_dict:
+        kv_cache = torch.empty(
+            inference_params.max_batch_size,
+            inference_params.max_seqlen,
+            2,
+            num_heads,
+            head_dim,
+            dtype=kv.dtype,
+            device=kv.device,
+        )
+        inference_params.key_value_memory_dict[layer_idx] = kv_cache
+    else:
+        kv_cache = inference_params.key_value_memory_dict[layer_idx]
+    # Adjust key and value for inference
+    batch_start = inference_params.batch_size_offset
+    batch_end = batch_start + kv.shape[0]
+    sequence_start = inference_params.seqlen_offset
+    sequence_end = sequence_start + kv.shape[1]
+    assert batch_end <= kv_cache.shape[0]
+    assert sequence_end <= kv_cache.shape[1]
+    assert kv_cache is not None
+    kv_cache[batch_start:batch_end, sequence_start:sequence_end, ...] = kv
+    return kv_cache[batch_start:batch_end, :sequence_end, ...]
+class MHA(nn.Module):
+    """Multi-head self-attention and cross-attention"""
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        num_heads_kv=None,
+        cross_attn=False,
+        qkv_proj_bias=True,
+        out_proj_bias=True,
+        dropout=0.0,
+        softmax_scale=None,
+        causal=False,
+        layer_idx=None,
+        dwconv=False,
+        window_size=(-1, -1),
+        fused_bias_fc=False,
+        use_flash_attn=False,
+        return_residual=False,
+        checkpointing=False,
+        device=None,
+        dtype=None,
+    ) -> None:
+        """
+        num_heads_kv: can be used to toggle MQA / GQA. If None, use num_heads.
+        return_residual: whether to return the input x along with the output. This is for
+            performance reason: for post-norm architecture, returning the input allows us
+            to fuse the backward of nn.Linear with the residual connection.
+        """
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.cross_attn = cross_attn
+        self.causal = causal
+        self.layer_idx = layer_idx
+        self.dwconv = dwconv
+        self.use_flash_attn = use_flash_attn
+        self.return_residual = return_residual
+        self.checkpointing = checkpointing
+        if window_size != (-1, -1):
+            assert use_flash_attn, "Local (sliding window) attention code path requires flash_attn"
+        self.num_heads = num_heads
+        self.num_heads_kv = num_heads_kv if num_heads_kv is not None else num_heads
+        assert (
+            self.num_heads % self.num_heads_kv == 0
+        ), "num_heads must be divisible by num_heads_kv"
+        assert self.embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
+        self.head_dim = self.embed_dim // num_heads
+        qkv_dim = self.head_dim * (self.num_heads + 2 * self.num_heads_kv)
+        kv_dim = 2 * self.head_dim * self.num_heads_kv
+        if fused_bias_fc and FusedDense is None:
+            raise ImportError("fused_dense is not installed")
+        linear_cls = nn.Linear if not fused_bias_fc else FusedDense
+        linear_resid_cls = (
+            LinearResidual if not fused_bias_fc else partial(FusedDense, return_residual=True)
+        )
+        wqkv_cls = linear_cls if not self.return_residual else linear_resid_cls
+        inner_attn_cls = (
+            partial(FlashSelfAttention, window_size=window_size)
+            if use_flash_attn
+            else SelfAttention
+        )
+        inner_cross_attn_cls = (
+            partial(FlashCrossAttention, window_size=window_size)
+            if use_flash_attn
+            else CrossAttention
+        )
+        if not self.cross_attn:
+            self.Wqkv = wqkv_cls(embed_dim, qkv_dim, bias=qkv_proj_bias, **factory_kwargs)
+        else:
+            self.Wq = linear_cls(embed_dim, embed_dim, bias=qkv_proj_bias, **factory_kwargs)
+            self.Wkv = wqkv_cls(embed_dim, kv_dim, bias=qkv_proj_bias, **factory_kwargs)
+        if self.dwconv:
+            if self.num_heads_kv == self.num_heads:
+                self.dwconv_qkv = nn.Conv1d(
+                    qkv_dim, qkv_dim, kernel_size=3, padding=2, groups=qkv_dim
+                )
+            else:
+                self.dwconv_q = nn.Conv1d(
+                    embed_dim, embed_dim, kernel_size=3, padding=2, groups=embed_dim
+                )
+                self.dwconv_kv = nn.Conv1d(kv_dim, kv_dim, kernel_size=3, padding=2, groups=kv_dim)
+        self.inner_attn = inner_attn_cls(
+            causal=causal,
+            softmax_scale=softmax_scale,
+            attention_dropout=dropout,
+        )
+        self.inner_cross_attn = inner_cross_attn_cls(
+            causal=causal, softmax_scale=softmax_scale, attention_dropout=dropout
+        )
+        self.out_proj = linear_cls(embed_dim, embed_dim, bias=out_proj_bias, **factory_kwargs)
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None):
+        dtype = self.out_proj.weight.dtype if dtype is None else dtype
+        device = self.out_proj.weight.device
+        return torch.empty(
+            batch_size,
+            max_seqlen,
+            2,
+            self.num_heads_kv,
+            self.head_dim,
+            dtype=dtype,
+            device=device,
+        )
+    def _update_kv_cache(self, kv, inference_params):
+        """kv: (batch_size, seqlen, 2, nheads, head_dim) or (batch_size, 1, 2, nheads, head_dim)"""
+        assert not self.dwconv, "Generation does not support dwconv yet"
+        assert self.layer_idx is not None, "Generation requires layer_idx in the constructor"
+        return _update_kv_cache(kv, inference_params, self.layer_idx)
+    def _apply_rotary_update_kvcache_attention(self, q, kv, inference_params):
+        """
+        Fast path that combine 3 steps: apply rotary to Q and K, update kv cache, and apply attention.
+        q: (batch_size, seqlen_q, nheads, head_dim)
+        kv: (batch_size, seqlen_k, 2, nheads_kv, head_dim)
+        """
+        assert inference_params is not None and inference_params.seqlen_offset > 0
+        assert self.use_flash_attn
+        batch = q.shape[0]
+        kv_cache = inference_params.key_value_memory_dict[self.layer_idx][:batch]
+        cache_seqlens = (
+            inference_params.lengths_per_sample[:batch]
+            if inference_params.lengths_per_sample is not None
+            else inference_params.seqlen_offset
+        )
+        context = flash_attn_with_kvcache(
+            q,
+            kv_cache[:, :, 0],
+            kv_cache[:, :, 1],
+            kv[:, :, 0],
+            kv[:, :, 1],
+            cache_seqlens=cache_seqlens,
+            softmax_scale=self.inner_cross_attn.softmax_scale,
+            causal=self.inner_cross_attn.causal,
+            rotary_interleaved=False,
+            alibi_slopes=None,
+        )
+        return context
+    def _update_kvcache_attention(self, q, kv, inference_params):
+        """Write kv to inference_params, then do attention"""
+        if (
+            inference_params.seqlen_offset == 0
+            or flash_attn_with_kvcache is None
+            or not self.use_flash_attn
+        ):
+            # TODO: this only uses seqlen_offset and not lengths_per_sample.
+            kv = self._update_kv_cache(kv, inference_params)
+            return self.inner_cross_attn(q, kv)
+        else:
+            batch = q.shape[0]
+            kv_cache = inference_params.key_value_memory_dict[self.layer_idx][:batch]
+            cache_seqlens = (
+                inference_params.lengths_per_sample[:batch]
+                if inference_params.lengths_per_sample is not None
+                else inference_params.seqlen_offset
+            )
+            return flash_attn_with_kvcache(
+                q,
+                kv_cache[:, :, 0],
+                kv_cache[:, :, 1],
+                kv[:, :, 0],
+                kv[:, :, 1],
+                cache_seqlens=cache_seqlens,
+                softmax_scale=self.inner_cross_attn.softmax_scale,
+                causal=self.inner_cross_attn.causal,
+                alibi_slopes=None,
+            )
+    def forward(
+        self,
+        x,
+        x_kv=None,
+        key_padding_mask=None,
+        cu_seqlens=None,
+        max_seqlen=None,
+        mixer_subset=None,
+        inference_params=None,
+        **kwargs,
+    ):
+        """
+        Arguments:
+            x: (batch, seqlen, hidden_dim) (where hidden_dim = num heads * head dim) if
+                cu_seqlens is None and max_seqlen is None, else (total, hidden_dim) where total
+                is the is the sum of the sequence lengths in the batch.
+            x_kv: (batch, seqlen, hidden_dim), only applicable for cross-attention. If None, use x.
+            cu_seqlens: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths
+                of the sequences in the batch, used to index into x. Only applicable when using
+                FlashAttention.
+            max_seqlen: int. Maximum sequence length in the batch.
+            key_padding_mask: boolean mask, True means to keep, False means to mask out.
+                (batch, seqlen). Only applicable when not using FlashAttention.
+            mixer_subset: for cross-attention only. If not None, will take a subset of x
+                before applying the query projection. Useful for e.g., ViT where we only care
+                about the CLS token in the last layer.
+            inference_params: for generation. Adapted from Megatron-LM (and Apex)
+            https://github.com/NVIDIA/apex/blob/3ff1a10f72ec07067c4e44759442329804ac5162/apex/transformer/testing/standalone_transformer_lm.py#L470
+        """
+        if cu_seqlens is not None:
+            assert max_seqlen is not None
+            assert key_padding_mask is None
+            assert self.use_flash_attn
+            assert not self.dwconv
+        if key_padding_mask is not None:
+            assert cu_seqlens is None
+            assert max_seqlen is None
+            assert not self.use_flash_attn
+        if inference_params is not None:
+            assert key_padding_mask is None
+            assert cu_seqlens is None and max_seqlen is None
+            assert not self.dwconv
+        kwargs = (
+            {"cu_seqlens": cu_seqlens, "max_seqlen": max_seqlen, **kwargs}
+            if self.use_flash_attn
+            else {"key_padding_mask": key_padding_mask, **kwargs}
+        )
+        seqlen_offset = (
+            0
+            if inference_params is None
+            else (
+                inference_params.lengths_per_sample
+                if inference_params.lengths_per_sample is not None
+                else inference_params.seqlen_offset
+            )
+        )
+        rotary_max_seqlen = (
+            inference_params.max_sequence_len if inference_params is not None else max_seqlen
+        )
+        batch, seqlen = x.shape[:2]
+        if not self.cross_attn and self.num_heads_kv == self.num_heads:
+            assert x_kv is None and mixer_subset is None
+            if not self.return_residual:
+                qkv = self.Wqkv(x)
+            else:
+                qkv, x = self.Wqkv(x)
+            if self.dwconv:
+                qkv = rearrange(
+                    self.dwconv_qkv(rearrange(qkv, "b s d -> b d s"))[..., :-2], "b d s -> b s d"
+                ).contiguous()
+            qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
+            if (
+                inference_params is None
+                or inference_params.seqlen_offset == 0
+                or (self.rotary_emb_dim == 0 or self.rotary_emb_dim % 16 != 0)
+                or not self.use_flash_attn
+            ):
+                if inference_params is None:
+                    if not self.checkpointing:
+                        context = self.inner_attn(qkv, **kwargs)
+                    else:
+                        context = torch.utils.checkpoint.checkpoint(self.inner_attn, qkv, **kwargs)
+                else:
+                    context = self._update_kvcache_attention(
+                        qkv[:, :, 0], qkv[:, :, 1:], inference_params
+                    )
+            else:
+                context = self._apply_rotary_update_kvcache_attention(
+                    qkv[:, :, 0], qkv[:, :, 1:], inference_params
+                )
+        else:
+            if self.cross_attn:
+                if not self.return_residual:
+                    q = self.Wq(x if mixer_subset is None else x[:, mixer_subset])
+                    kv = self.Wkv(x_kv if x_kv is not None else x)
+                else:
+                    if x_kv is not None:
+                        kv, x_kv = self.Wkv(x_kv)
+                    else:
+                        kv, x = self.Wkv(x)
+                    q = self.Wq(x if mixer_subset is None else x[:, mixer_subset])
+            else:
+                assert self.num_heads_kv != self.num_heads
+                if not self.return_residual:
+                    qkv = self.Wqkv(x)
+                else:
+                    qkv, x = self.Wqkv(x)
+                q = qkv[..., : self.num_heads * self.head_dim]
+                kv = qkv[..., self.num_heads * self.head_dim :]
+            q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
+            kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
+            if self.dwconv:
+                q = rearrange(
+                    self.dwconv_q(rearrange(q, "b s d -> b d s"))[..., :-2], "b d s -> b s d"
+                ).contiguous()
+                kv = rearrange(
+                    self.dwconv_kv(rearrange(kv, "b s d -> b d s"))[..., :-2], "b d s -> b s d"
+                ).contiguous()
+            if (
+                inference_params is None
+                or inference_params.seqlen_offset == 0
+                or (self.rotary_emb_dim == 0 or self.rotary_emb_dim % 16 != 0)
+                or not self.use_flash_attn
+            ):
+                if inference_params is None:
+                    if not self.checkpointing:
+                        context = self.inner_cross_attn(q, kv, **kwargs)
+                    else:
+                        context = torch.utils.checkpoint.checkpoint(
+                            self.inner_cross_attn, q, kv, **kwargs
+                        )
+                else:
+                    context = self._update_kvcache_attention(q, kv, inference_params)
+            else:
+                context = self._apply_rotary_update_kvcache_attention(q, kv, inference_params)
+        out = self.out_proj(rearrange(context, "... h d -> ... (h d)"))
+        return out if not self.return_residual else (out, x)

mlp.py ADDED Viewed

	@@ -0,0 +1,194 @@

+# This implementation was adapted from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mlp.py
+# Commit id: c3b219665292c61a51153d0ded4473c494296382
+# Copyright (c) 2023, Tri Dao.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributed import ProcessGroup
+try:
+    from flash_attn.ops.activations import swiglu
+except ImportError:
+    swiglu = None
+try:
+    from flash_attn.ops.fused_dense import ColumnParallelLinear, RowParallelLinear
+except ImportError:
+    ColumnParallelLinear, RowParallelLinear = None, None
+try:
+    from flash_attn.ops.fused_dense import FusedMLP, ParallelFusedMLP
+except ImportError:
+    FusedMLP, ParallelFusedMLP = None, None
+class Mlp(nn.Module):
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        activation=F.gelu,
+        bias1=True,
+        bias2=True,
+        return_residual=False,
+        device=None,
+        dtype=None,
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = hidden_features if hidden_features is not None else in_features * 4
+        self.return_residual = return_residual
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias1, **factory_kwargs)
+        self.activation = activation
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias2, **factory_kwargs)
+    def forward(self, x):
+        y = self.fc1(x)
+        y = self.activation(y)
+        y = self.fc2(y)
+        return y if not self.return_residual else (y, x)
+class ParallelMLP(nn.Module):
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        activation=F.gelu,
+        process_group: ProcessGroup = None,
+        sequence_parallel=True,
+        bias1=True,
+        bias2=True,
+        device=None,
+        dtype=None,
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        assert ColumnParallelLinear is not None, "Need to install fused_dense"
+        assert RowParallelLinear is not None, "Need to install fused_dense"
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = hidden_features if hidden_features is not None else in_features * 4
+        self.fc1 = ColumnParallelLinear(
+            in_features,
+            hidden_features,
+            process_group,
+            bias=bias1,
+            sequence_parallel=sequence_parallel,
+            **factory_kwargs,
+        )
+        self.activation = activation
+        self.fc2 = RowParallelLinear(
+            hidden_features,
+            out_features,
+            process_group,
+            bias=bias2,
+            sequence_parallel=sequence_parallel,
+            **factory_kwargs,
+        )
+    def forward(self, x):
+        y = self.fc1(x)
+        y = self.activation(y)
+        y = self.fc2(y)
+        return y
+class GatedMlp(nn.Module):
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        activation=F.sigmoid,
+        bias1=True,
+        bias2=True,
+        multiple_of=128,
+        return_residual=False,
+        device=None,
+        dtype=None,
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = (
+            hidden_features if hidden_features is not None else int(8 * in_features / 3)
+        )
+        hidden_features = (hidden_features + multiple_of - 1) // multiple_of * multiple_of
+        self.return_residual = return_residual
+        self.fc1 = nn.Linear(in_features, 2 * hidden_features, bias=bias1, **factory_kwargs)
+        self.activation = activation
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias2, **factory_kwargs)
+    def forward(self, x):
+        y = self.fc1(x)
+        if self.activation == F.sigmoid:  # Special case for GLU
+            y = F.glu(y, dim=-1)
+        elif self.activation == F.silu and swiglu is not None:  # Special case for SwiGLU
+            y, gate = y.chunk(2, dim=-1)
+            y = swiglu(gate, y)
+        else:
+            y, gate = y.chunk(2, dim=-1)
+            y = y * self.activation(gate)
+        y = self.fc2(y)
+        return y if not self.return_residual else (y, x)
+class ParallelGatedMlp(nn.Module):
+    """Parallel GatedMlp"""
+    def __init__(
+        self,
+        in_features,
+        process_group,
+        hidden_features=None,
+        out_features=None,
+        activation=F.sigmoid,
+        bias1=True,
+        bias2=True,
+        multiple_of=128,
+        sequence_parallel=True,
+        device=None,
+        dtype=None,
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = (
+            hidden_features if hidden_features is not None else int(8 * in_features / 3)
+        )
+        hidden_features = (hidden_features + multiple_of - 1) // multiple_of * multiple_of
+        if ColumnParallelLinear is None or RowParallelLinear is None:
+            raise ImportError("fused_dense is not installed")
+        self.fc1 = ColumnParallelLinear(
+            in_features,
+            2 * hidden_features,
+            process_group,
+            bias=bias1,
+            sequence_parallel=sequence_parallel,
+            **factory_kwargs,
+        )
+        self.activation = activation
+        self.fc2 = RowParallelLinear(
+            hidden_features,
+            out_features,
+            process_group,
+            bias=bias2,
+            sequence_parallel=sequence_parallel,
+            **factory_kwargs,
+        )
+    def forward(self, x):
+        y = self.fc1(x)
+        if self.activation == F.sigmoid:  # Special case for GLU
+            y = F.glu(y, dim=-1)
+        else:
+            y, gate = y.chunk(2, dim=-1)
+            y = y * self.activation(gate)
+        y = self.fc2(y)
+        return y

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c7c4b46d9d342fe128e46a55adbac52fd96dd0aaafee7c26e8d8f2286bee91a
+size 556892306

modeling_xlm_roberta.py ADDED Viewed

	@@ -0,0 +1,1119 @@

+# This implementation was adopted from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/models/bert.py
+# Commit id: abbc1311731867310635f9edc2a9ec18317c8c48
+# Copyright (c) 2022, Tri Dao.
+# This BERT implementation is based on our MLPerf 2.0 and MLPerf 2.1 BERT implementation.
+# https://github.com/mlcommons/training_results_v2.0/blob/main/HazyResearch/benchmarks/bert/implementations/pytorch/modeling.py
+# https://github.com/mlcommons/training_results_v2.1/blob/main/Azure-HazyResearch/benchmarks/bert/implementations/ND96amsr_A100_v4/modeling.py
+# Inspired by https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
+import importlib.util
+import logging
+import re
+from collections import OrderedDict
+from collections.abc import Sequence
+from functools import partial
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from einops import rearrange
+from transformers import PretrainedConfig
+from transformers.modeling_utils import PreTrainedModel
+from transformers.modeling_outputs import MaskedLMOutput,SequenceClassifierOutput
+from transformers.models.xlm_roberta.modeling_xlm_roberta import XLMRobertaLMHead
+from transformers.models.bert.modeling_bert import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    BertForPreTrainingOutput,
+)
+from typing import List, Optional, Tuple, Union
+from .xlm_padding import (
+    index_first_axis,
+    index_first_axis_residual,
+    pad_input,
+    unpad_input,
+)
+from .configuration_xlm_roberta import XLMRobertaFlashConfig
+from .block import Block
+from .embedding import XLMRobertaEmbeddings
+from .mha import MHA
+from .mlp import FusedMLP, Mlp
+try:
+    from flash_attn.ops.fused_dense import FusedDense
+except ImportError:
+    FusedDense = None
+try:
+    from flash_attn.ops.triton.layer_norm import layer_norm_fn
+except ImportError:
+    layer_norm_fn = None
+try:
+    from flash_attn.losses.cross_entropy import CrossEntropyLoss
+except ImportError:
+    CrossEntropyLoss = torch.nn.CrossEntropyLoss
+try:
+    from tqdm.autonotebook import trange
+except ImportError:
+    trange = None
+logger = logging.getLogger(__name__)
+def get_use_flash_attn(config: XLMRobertaFlashConfig):
+    if not getattr(config, "use_flash_attn", False):
+        return False
+    if not torch.cuda.is_available():
+        return False
+    if importlib.util.find_spec("flash_attn") is None:
+        logger.warning(
+            'flash_attn is not installed. Using PyTorch native attention implementation.'
+        )
+        return False
+    return True
+def create_mixer_cls(config, cross_attn=False, return_residual=False):
+    use_flash_attn = get_use_flash_attn(config)
+    fused_bias_fc = getattr(config, "fused_bias_fc", False)
+    mixer_cls = partial(
+        MHA,
+        num_heads=config.num_attention_heads,
+        cross_attn=cross_attn,
+        dropout=config.attention_probs_dropout_prob,
+        causal=False,
+        fused_bias_fc=fused_bias_fc,
+        use_flash_attn=use_flash_attn,
+        return_residual=return_residual,
+    )
+    return mixer_cls
+def create_mlp_cls(config, layer_idx=None, return_residual=False):
+    inner_dim = config.intermediate_size
+    fused_mlp = getattr(config, "fused_mlp", False)
+    if fused_mlp:
+        assert config.hidden_act in ["gelu_new", "gelu_fast", "gelu_pytorch_tanh"], (
+            "fused_mlp only " "supports approximate gelu"
+        )
+    if not fused_mlp:
+        approximate = (
+            "tanh"
+            if config.hidden_act in ["gelu_new", "gelu_fast", "gelu_pytorch_tanh"]
+            else "none"
+        )
+        mlp_cls = partial(
+            Mlp,
+            hidden_features=inner_dim,
+            activation=partial(F.gelu, approximate=approximate),
+            return_residual=return_residual,
+        )
+    else:
+        if FusedMLP is None:
+            raise ImportError("fused_dense is not installed")
+        mlp_checkpoint_lvl = getattr(config, "mlp_checkpoint_lvl", 0)
+        # mlp_checkpoint_lvl could be a list, which contains the checkpoint_lvl for each layer
+        if isinstance(mlp_checkpoint_lvl, Sequence):
+            assert layer_idx is not None
+            mlp_checkpoint_lvl = mlp_checkpoint_lvl[layer_idx]
+        mlp_cls = partial(
+            FusedMLP,
+            hidden_features=inner_dim,
+            checkpoint_lvl=mlp_checkpoint_lvl,
+            return_residual=return_residual,
+        )
+    return mlp_cls
+def create_block(config, layer_idx=None):
+    last_layer_subset = getattr(config, "last_layer_subset", False)
+    cross_attn = last_layer_subset and layer_idx == config.num_hidden_layers - 1
+    # TD [2022-12-19]: For cross attention (last layer), we actually want to return the
+    # residual x_kv, not residual x. But it's annoying to change the API (and it only affects
+    # one layer) so we just choose not to return residual in this case.
+    return_residual = not cross_attn
+    mixer_cls = create_mixer_cls(config, cross_attn, return_residual=return_residual)
+    mlp_cls = create_mlp_cls(config, layer_idx, return_residual=return_residual)
+    norm_cls = partial(nn.LayerNorm, eps=config.layer_norm_eps)
+    block = Block(
+        config.hidden_size,
+        mixer_cls,
+        mlp_cls,
+        norm_cls=norm_cls,
+        prenorm=False,
+        resid_dropout1=config.hidden_dropout_prob,
+        resid_dropout2=config.hidden_dropout_prob,
+        fused_dropout_add_ln=getattr(config, "fused_dropout_add_ln", False),
+        return_residual=return_residual,
+    )
+    return block
+# https://github.com/huggingface/transformers/blob/7032e0203262ebb2ebf55da8d2e01f873973e835/src/transformers/models/bert/modeling_bert.py#L748
+def _init_weights(module, initializer_range=0.02):
+    if isinstance(module, nn.Linear):
+        nn.init.normal_(module.weight, std=initializer_range)
+        if module.bias is not None:
+            nn.init.zeros_(module.bias)
+    elif isinstance(module, nn.Embedding):
+        nn.init.normal_(module.weight, std=initializer_range)
+        if module.padding_idx is not None:
+            nn.init.zeros_(module.weight[module.padding_idx])
+class XLMRobertaEncoder(nn.Module):
+    def __init__(self, config: XLMRobertaFlashConfig):
+        super().__init__()
+        self.use_flash_attn = get_use_flash_attn(config)
+        self.layers = nn.ModuleList(
+            [create_block(config, layer_idx=i) for i in range(config.num_hidden_layers)]
+        )
+        self._grad_checkpointing = False
+    @property
+    def gradient_checkpointing(self):
+        return self._grad_checkpointing
+    @gradient_checkpointing.setter
+    def gradient_checkpointing(self, value):
+        self._grad_checkpointing = value
+    def forward(self, hidden_states, key_padding_mask=None, subset_mask=None):
+        """If subset_mask is not None, we only want output for the subset of the sequence.
+        This means that we only compute the last layer output for these tokens.
+        subset_mask: (batch, seqlen), dtype=torch.bool
+        """
+        if key_padding_mask is None or not self.use_flash_attn:
+            mixer_kwargs = (
+                {"key_padding_mask": key_padding_mask.bool()}
+                if key_padding_mask is not None
+                else None
+            )
+            for layer in self.layers:
+                if self._grad_checkpointing:
+                    hidden_states = torch.utils.checkpoint.checkpoint(
+                        layer,
+                        hidden_states,
+                        use_reentrant=False,
+                        mixer_kwargs=mixer_kwargs,
+                    )
+                else:
+                    hidden_states = layer(hidden_states, mixer_kwargs=mixer_kwargs)
+            if subset_mask is not None:
+                hidden_states = hidden_states[subset_mask]
+        else:
+            batch, seqlen = hidden_states.shape[:2]
+            hidden_states, indices, cu_seqlens, max_seqlen_in_batch = unpad_input(
+                hidden_states, key_padding_mask
+            )
+            mixer_kwargs = {"cu_seqlens": cu_seqlens, "max_seqlen": max_seqlen_in_batch}
+            if subset_mask is None:
+                for layer in self.layers:
+                    if self._grad_checkpointing:
+                        hidden_states = torch.utils.checkpoint.checkpoint(
+                            layer,
+                            hidden_states,
+                            use_reentrant=False,
+                            mixer_kwargs=mixer_kwargs,
+                        )
+                    else:
+                        hidden_states = layer(hidden_states, mixer_kwargs=mixer_kwargs)
+                hidden_states = pad_input(hidden_states, indices, batch, seqlen)
+            else:
+                for layer in self.layers[:-1]:
+                    if self._grad_checkpointing:
+                        hidden_states = torch.utils.checkpoint.checkpoint(
+                            layer,
+                            hidden_states,
+                            use_reentrant=False,
+                            mixer_kwargs=mixer_kwargs,
+                        )
+                    else:
+                        hidden_states = layer(hidden_states, mixer_kwargs=mixer_kwargs)
+                if key_padding_mask is not None:
+                    subset_idx = torch.nonzero(
+                        subset_mask[key_padding_mask], as_tuple=False
+                    ).flatten()
+                    subset_seqlens = (subset_mask & key_padding_mask).sum(
+                        dim=-1, dtype=torch.int32
+                    )
+                    subset_cu_seqlens = F.pad(
+                        torch.cumsum(subset_seqlens, dim=0, dtype=torch.torch.int32),
+                        (1, 0),
+                    )
+                else:
+                    subset_idx = torch.nonzero(subset_mask, as_tuple=False).flatten()
+                    subset_seqlens = subset_mask.sum(dim=-1, dtype=torch.int32)
+                    subset_cu_seqlens = F.pad(
+                        torch.cumsum(subset_seqlens, dim=0, dtype=torch.torch.int32),
+                        (1, 0),
+                    )
+                hidden_states_subset, hidden_states = index_first_axis_residual(
+                    hidden_states, subset_idx
+                )
+                # It's ok to set max_seqlen_q to be much larger
+                mixer_kwargs = {
+                    "x_kv": hidden_states,
+                    "cu_seqlens": subset_cu_seqlens,
+                    "max_seqlen": max_seqlen_in_batch,
+                    "cu_seqlens_k": cu_seqlens,
+                    "max_seqlen_k": max_seqlen_in_batch,
+                }
+                if self._grad_checkpointing:
+                    torch.utils.checkpoint.checkpoint(
+                        self.layers[-1],
+                        hidden_states_subset,
+                        use_reentrant=False,
+                        mixer_kwargs=mixer_kwargs,
+                    )
+                else:
+                    hidden_states = self.layers[-1](
+                        hidden_states_subset, mixer_kwargs=mixer_kwargs
+                    )
+        return hidden_states
+class XLMRobertaPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        fused_bias_fc = getattr(config, "fused_bias_fc", False)
+        if fused_bias_fc and FusedDense is None:
+            raise ImportError("fused_dense is not installed")
+        linear_cls = nn.Linear if not fused_bias_fc else FusedDense
+        self.dense = linear_cls(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states, pool=True):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0] if pool else hidden_states
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class XLMRobertaPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        fused_bias_fc = getattr(config, "fused_bias_fc", False)
+        if fused_bias_fc and FusedDense is None:
+            raise ImportError("fused_dense is not installed")
+        self.fused_dropout_add_ln = getattr(config, "fused_dropout_add_ln", False)
+        if self.fused_dropout_add_ln and layer_norm_fn is None:
+            raise ImportError("Triton is not installed")
+        linear_cls = nn.Linear if not fused_bias_fc else FusedDense
+        self.dense = linear_cls(config.hidden_size, config.hidden_size)
+        approximate = (
+            "tanh"
+            if config.hidden_act in ["gelu_new", "gelu_fast", "gelu_pytorch_tanh"]
+            else "none"
+        )
+        self.transform_act_fn = nn.GELU(approximate=approximate)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        if not self.fused_dropout_add_ln:
+            hidden_states = self.layer_norm(hidden_states)
+        else:
+            hidden_states = layer_norm_fn(
+                hidden_states,
+                self.layer_norm.weight,
+                self.layer_norm.bias,
+                eps=self.layer_norm.eps,
+            )
+        return hidden_states
+class XLMRobertaLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        fused_bias_fc = getattr(config, "fused_bias_fc", False)
+        if fused_bias_fc and FusedDense is None:
+            raise ImportError("fused_dense is not installed")
+        linear_cls = nn.Linear if not fused_bias_fc else FusedDense
+        self.transform = XLMRobertaPredictionHeadTransform(config)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = linear_cls(config.hidden_size, config.vocab_size, bias=True)
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class XLMRobertaPreTrainingHeads(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = XLMRobertaLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+class XLMRobertaPreTrainedModel(PreTrainedModel):
+    """An abstract class to handle weights initialization and
+    a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = XLMRobertaFlashConfig
+    base_model_prefix = "roberta"
+    supports_gradient_checkpointing = True
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, XLMRobertaEncoder):
+            module.gradient_checkpointing = value
+    @classmethod
+    def from_pretrained(
+        cls,
+        *args,
+        **kwargs,
+    ):
+        if not 'torch_dtype' in kwargs:
+            kwargs['torch_dtype'] = 'auto'
+        return super().from_pretrained(*args, **kwargs)
+class XLMRobertaModel(XLMRobertaPreTrainedModel):
+    def __init__(self, config: XLMRobertaFlashConfig, add_pooling_layer=True):
+        super().__init__(config)
+        self.pad_vocab_size_multiple = getattr(config, "pad_vocab_size_multiple", 1)
+        if config.vocab_size % self.pad_vocab_size_multiple != 0:
+            config.vocab_size += self.pad_vocab_size_multiple - (
+                config.vocab_size % self.pad_vocab_size_multiple
+            )
+        self.fused_dropout_add_ln = getattr(config, "fused_dropout_add_ln", False)
+        if self.fused_dropout_add_ln and layer_norm_fn is None:
+            raise ImportError("Triton is not installed")
+        assert config.hidden_act in [
+            "gelu",
+            "gelu_new",
+            "gelu_fast",
+            "gelu_pytorch_tanh",
+        ]
+        self.embeddings = XLMRobertaEmbeddings(
+            config.hidden_size,
+            config.vocab_size,
+            config.max_position_embeddings if config.position_embedding_type == 'absolute' else -1,
+            config.type_vocab_size,
+            padding_idx=config.pad_token_id,
+        )
+        self.emb_drop = nn.Dropout(config.hidden_dropout_prob)
+        self.emb_ln = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.encoder = XLMRobertaEncoder(config)
+        self.pooler = XLMRobertaPooler(config) if add_pooling_layer else None
+        self.apply(partial(_init_weights, initializer_range=config.initializer_range))
+    @torch.inference_mode()
+    def encode(
+        self: 'XLMRobertaModel',
+        sentences: Union[str, List[str]],
+        batch_size: int = 32,
+        show_progress_bar: Optional[bool] = None,
+        output_value: str = 'sentence_embedding',
+        convert_to_numpy: bool = True,
+        convert_to_tensor: bool = False,
+        device: Optional[torch.device] = None,
+        normalize_embeddings: bool = False,
+        truncate_dim: Optional[int] = None,
+        **tokenizer_kwargs,
+    ) -> Union[List[torch.Tensor], np.ndarray, torch.Tensor]:
+        """
+        Computes sentence embeddings
+        Args:
+            sentences(`str` or `List[str]`):
+                Sentence or sentences to be encoded
+            batch_size(`int`, *optional*, defaults to 32):
+                Batch size for the computation
+            show_progress_bar(`bool`, *optional*, defaults to None):
+                Show a progress bar when encoding sentences.
+                If set to None, progress bar is only shown when
+                `logger.level == logging.INFO` or `logger.level == logging.DEBUG`.
+            output_value(`str`, *optional*, defaults to 'sentence_embedding'):
+                Default sentence_embedding, to get sentence embeddings.
+                Can be set to token_embeddings to get wordpiece token embeddings.
+                Set to None, to get all output values
+            convert_to_numpy(`bool`, *optional*, defaults to True):
+                If true, the output is a list of numpy vectors.
+                Else, it is a list of pytorch tensors.
+            convert_to_tensor(`bool`, *optional*, defaults to False):
+                If true, you get one large tensor as return.
+                Overwrites any setting from convert_to_numpy
+            device(`torch.device`, *optional*, defaults to None):
+                Which torch.device to use for the computation
+            normalize_embeddings(`bool`, *optional*, defaults to False):
+                If set to true, returned vectors will have length 1. In that case, the
+                faster dot-product (util.dot_score) instead of cosine similarity can
+                be used.
+            truncate_dim(`int`, *optional*, defaults to None):
+                The dimension to truncate sentence embeddings to. `None` does no truncation.
+            tokenizer_kwargs(`Dict[str, Any]`, *optional*, defaults to {}):
+                Keyword arguments for the tokenizer
+        Returns:
+            By default, a list of tensors is returned.
+            If convert_to_tensor, a stacked tensor is returned.
+            If convert_to_numpy, a numpy matrix is returned.
+        """
+        from transformers import AutoTokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.name_or_path, trust_remote_code=True
+        )
+        is_training = self.training
+        self.eval()
+        if show_progress_bar is None:
+            show_progress_bar = (
+                logger.getEffectiveLevel() == logging.INFO
+                or logger.getEffectiveLevel() == logging.DEBUG
+            )
+        if convert_to_tensor:
+            convert_to_numpy = False
+        if output_value != 'sentence_embedding':
+            convert_to_tensor = False
+            convert_to_numpy = False
+        input_was_string = False
+        if isinstance(sentences, str) or not hasattr(sentences, '__len__'):
+            sentences = [sentences]
+            input_was_string = True
+        if device is not None:
+            self.to(device)
+        permutation = np.argsort([-len(i) for i in sentences])
+        inverse_permutation = np.argsort(permutation)
+        sentences = [sentences[idx] for idx in permutation]
+        tokenizer_kwargs['padding'] = tokenizer_kwargs.get('padding', True)
+        tokenizer_kwargs['max_length'] = tokenizer_kwargs.get(
+            'max_length', self.tokenizer.init_kwargs.get('model_max_length', 8192)
+        )
+        tokenizer_kwargs['truncation'] = tokenizer_kwargs.get('truncation', True)
+        all_embeddings = []
+        if trange is not None:
+            range_iter = trange(
+                0,
+                len(sentences),
+                batch_size,
+                desc="Encoding",
+                disable=not show_progress_bar,
+            )
+        else:
+            range_iter = range(0, len(sentences), batch_size)
+        for i in range_iter:
+            encoded_input = self.tokenizer(
+                sentences[i : i + batch_size],
+                return_tensors='pt',
+                **tokenizer_kwargs,
+            ).to(self.device)
+            token_embs = self.forward(**encoded_input)[0]
+            # Accumulate in fp32 to avoid overflow
+            token_embs = token_embs.float()
+            if output_value == 'token_embeddings':
+                raise NotImplementedError
+            elif output_value is None:
+                raise NotImplementedError
+            else:
+                if self.config.emb_pooler == 'cls':
+                    embeddings = self.cls_pooling(
+                        token_embs, encoded_input['attention_mask']
+                    )
+                else:
+                    embeddings = self.mean_pooling(
+                        token_embs, encoded_input['attention_mask']
+                    )
+                if normalize_embeddings:
+                    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
+                if convert_to_numpy:
+                    embeddings = embeddings.cpu()
+            all_embeddings.extend(embeddings)
+        all_embeddings = [all_embeddings[idx] for idx in inverse_permutation]
+        truncate_dim = truncate_dim or self.config.truncate_dim
+        if truncate_dim:
+            all_embeddings = self.truncate_embeddings(all_embeddings, truncate_dim)
+        if convert_to_tensor:
+            all_embeddings = torch.stack(all_embeddings)
+        elif convert_to_numpy:
+            all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
+        if input_was_string:
+            all_embeddings = all_embeddings[0]
+        self.train(is_training)
+        return all_embeddings
+    def truncate_embeddings(self, embeddings, truncate_dim):
+        if not self.config.matryoshka_dimensions:
+            logger.warning(
+                'Matryoshka embeddings are not supported, so dimension truncation will not be performed.'
+            )
+            return embeddings
+        elif truncate_dim in self.config.matryoshka_dimensions:
+            return [tensor[:truncate_dim] for tensor in embeddings]
+        else:
+            raise ValueError(f'The provided `truncate_dim` value of {truncate_dim} is not supported. '
+                             f'Supported dimensions are {self.config.matryoshka_dimensions}.')
+    def mean_pooling(
+        self, token_embeddings: torch.Tensor, attention_mask: torch.Tensor
+    ):
+        input_mask_expanded = (
+            attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        )
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
+            input_mask_expanded.sum(1), min=1e-9
+        )
+    def cls_pooling(
+        self, token_embeddings: torch.Tensor, attention_mask: torch.Tensor
+    ):
+        return token_embeddings[:,0]
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        token_type_ids=None,
+        attention_mask=None,
+        masked_tokens_mask=None,
+        return_dict=None,
+        **kwargs,
+    ):
+        """If masked_tokens_mask is not None (i.e. last_layer_subset == True in XLMForPreTraining),
+        we only want the output for the masked tokens. This means that we only compute the last
+        layer output for these tokens.
+        masked_tokens_mask: (batch, seqlen), dtype=torch.bool
+        """
+        if kwargs:
+            for key, value in kwargs.items():
+                if value is not None:
+                    logger.warning(
+                        'Flash attention implementation does not support kwargs: %s',
+                        key,
+                    )
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        hidden_states = self.embeddings(
+            input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        # TD [2022-12:18]: Don't need to force residual in fp32
+        # BERT puts embedding LayerNorm before embedding dropout.
+        if not self.fused_dropout_add_ln:
+            hidden_states = self.emb_ln(hidden_states)
+        else:
+            hidden_states = layer_norm_fn(
+                hidden_states, self.emb_ln.weight, self.emb_ln.bias, eps=self.emb_ln.eps
+            )
+        hidden_states = self.emb_drop(hidden_states)
+        if masked_tokens_mask is not None:
+            batch_size, seqlen = input_ids.shape[:2]
+            # We also need the first column for the CLS token
+            first_col_mask = torch.zeros(
+                batch_size, seqlen, dtype=torch.bool, device=input_ids.device
+            )
+            first_col_mask[:, 0] = True
+            subset_mask = masked_tokens_mask | first_col_mask
+        else:
+            subset_mask = None
+        sequence_output = self.encoder(
+            hidden_states, key_padding_mask=attention_mask, subset_mask=subset_mask
+        )
+        if masked_tokens_mask is None:
+            pooled_output = (
+                self.pooler(sequence_output) if self.pooler is not None else None
+            )
+        else:
+            # TD [2022-03-01]: the indexing here is very tricky.
+            if attention_mask is not None:
+                subset_idx = subset_mask[attention_mask]
+                pool_input = sequence_output[first_col_mask[attention_mask][subset_idx]]
+                sequence_output = sequence_output[
+                    masked_tokens_mask[attention_mask][subset_idx]
+                ]
+            else:
+                pool_input = sequence_output[first_col_mask[subset_mask]]
+                sequence_output = sequence_output[masked_tokens_mask[subset_mask]]
+            pooled_output = (
+                self.pooler(pool_input, pool=False) if self.pooler is not None else None
+            )
+        if not return_dict:
+            return sequence_output, pooled_output
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+        )
+class XLMRobertaForMaskedLM(XLMRobertaPreTrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
+    def __init__(self, config):
+        super().__init__(config)
+        if config.is_decoder:
+            logger.warning(
+                "If you want to use `XLMRobertaForMaskedLM` make sure `config.is_decoder=False` for "
+                "bi-directional self-attention."
+            )
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+        self.lm_head = XLMRobertaLMHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.roberta.embeddings.word_embeddings
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        kwargs (`Dict[str, any]`, optional, defaults to *{}*):
+            Used to hide legacy arguments that have been deprecated.
+        """
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+        masked_lm_loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(prediction_scores.device)
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return (
+                ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+            )
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+# Copied from transformers.models.roberta.modeling_roberta.RobertaClassificationHead with Roberta->XLMRoberta
+class XLMRobertaClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+    def __init__(self, config):
+        super().__init__()
+        fused_bias_fc = getattr(config, "fused_bias_fc", False)
+        if fused_bias_fc and FusedDense is None:
+            raise ImportError("fused_dense is not installed")
+        linear_cls = nn.Linear if not fused_bias_fc else FusedDense
+        self.dense = linear_cls(config.hidden_size, config.hidden_size)
+        classifier_dropout = (
+            config.classifier_dropout
+            if config.classifier_dropout is not None
+            else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.out_proj = linear_cls(config.hidden_size, config.num_labels)
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = torch.tanh(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+# Copied from transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification with Roberta->XLMRoberta, ROBERTA->XLM_ROBERTA
+class XLMRobertaForSequenceClassification(XLMRobertaPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
+        self.classifier = XLMRobertaClassificationHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        outputs = self.roberta(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (
+                    labels.dtype == torch.long or labels.dtype == torch.int
+                ):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    @torch.inference_mode()
+    def compute_score(
+        self,
+        sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]],
+        batch_size: int = 32,
+        max_length: Optional[int] = None,
+    ) -> List[float]:
+        if not hasattr(self, "_tokenizer"):
+            from transformers import AutoTokenizer
+            self._tokenizer = AutoTokenizer.from_pretrained(
+                self.name_or_path, trust_remote_code=True
+            )
+        assert isinstance(sentence_pairs, list)
+        if isinstance(sentence_pairs[0], str):
+            sentence_pairs = [sentence_pairs]
+        all_scores = []
+        for start_index in range(
+            0, len(sentence_pairs), batch_size
+        ):
+            sentences_batch = sentence_pairs[
+                start_index : start_index + batch_size
+            ]
+            inputs = self._tokenizer(
+                sentences_batch,
+                padding=True,
+                truncation=True,
+                return_tensors='pt',
+                max_length=max_length,
+            ).to(self.device)
+            scores = (
+                self.forward(**inputs, return_dict=True)
+                .logits.view(
+                    -1,
+                )
+                .float()
+            )
+            scores = torch.sigmoid(scores)
+            all_scores.extend(scores.cpu().numpy().tolist())
+        if len(all_scores) == 1:
+            return all_scores[0]
+        return all_scores
+    def predict(
+        self,
+        sentence_pairs: Union[List[Tuple[str, str]], Tuple[str, str]],
+        batch_size: int = 32,
+        max_length: Optional[int] = None,
+    ) -> List[float]:
+        # used for beir evaluation
+        return self.compute_score(sentence_pairs, batch_size=batch_size, max_length=max_length)
+    def rerank(
+        self,
+        query: str,
+        documents: List[str],
+        batch_size: int = 32,
+        max_length: int = 1024,
+        max_query_length: int = 512,
+        overlap_tokens: int = 80,
+        top_n: Optional[int] = None,
+        **kwargs,
+    ):
+        assert max_length >= max_query_length * 2, (
+            f'max_length ({max_length}) must be greater than or equal to '
+            f'max_query_length ({max_query_length}) * 2'
+        )
+        if not hasattr(self, "_tokenizer"):
+            from transformers import AutoTokenizer
+            self._tokenizer = AutoTokenizer.from_pretrained(
+                self.name_or_path, trust_remote_code=True
+            )
+        # preproc of tokenization
+        sentence_pairs, sentence_pairs_pids = reranker_tokenize_preproc(
+            query,
+            documents,
+            tokenizer=self._tokenizer,
+            max_length=max_length,
+            max_query_length=max_query_length,
+            overlap_tokens=overlap_tokens,
+        )
+        tot_scores = []
+        with torch.no_grad():
+            for k in range(0, len(sentence_pairs), batch_size):
+                batch = self._tokenizer.pad(
+                    sentence_pairs[k : k + batch_size],
+                    padding=True,
+                    max_length=max_length,
+                    pad_to_multiple_of=None,
+                    return_tensors="pt",
+                )
+                batch_on_device = {k: v.to(self.device) for k, v in batch.items()}
+                scores = (
+                    self.forward(**batch_on_device, return_dict=True)
+                    .logits.view(
+                        -1,
+                    )
+                    .float()
+                )
+                scores = torch.sigmoid(scores)
+                tot_scores.extend(scores.cpu().numpy().tolist())
+        # ranking
+        merge_scores = [0 for _ in range(len(documents))]
+        for pid, score in zip(sentence_pairs_pids, tot_scores):
+            merge_scores[pid] = max(merge_scores[pid], score)
+        merge_scores_argsort = np.argsort(merge_scores)[::-1]
+        sorted_documents = []
+        sorted_scores = []
+        for mid in merge_scores_argsort:
+            sorted_scores.append(merge_scores[mid])
+            sorted_documents.append(documents[mid])
+        top_n = min(top_n or len(sorted_documents), len(sorted_documents))
+        return [
+            {
+                'document': sorted_documents[i],
+                'relevance_score': sorted_scores[i],
+                'index': merge_scores_argsort[i],
+            }
+            for i in range(top_n)
+        ]
+def reranker_tokenize_preproc(
+    query: str,
+    passages: List[str],
+    tokenizer=None,
+    max_length: int = 1024,
+    max_query_length: int = 512,
+    overlap_tokens: int = 80,
+):
+    from copy import deepcopy
+    assert tokenizer is not None, "Please provide a valid tokenizer for tokenization!"
+    sep_id = tokenizer.sep_token_id
+    def _merge_inputs(chunk1_raw, chunk2):
+        chunk1 = deepcopy(chunk1_raw)
+        chunk1['input_ids'].append(sep_id)
+        chunk1['input_ids'].extend(chunk2['input_ids'])
+        chunk1['input_ids'].append(sep_id)
+        chunk1['attention_mask'].append(chunk2['attention_mask'][0])
+        chunk1['attention_mask'].extend(chunk2['attention_mask'])
+        chunk1['attention_mask'].append(chunk2['attention_mask'][-1])
+        if 'token_type_ids' in chunk1:
+            token_type_ids = [1 for _ in range(len(chunk2['token_type_ids']) + 2)]
+            chunk1['token_type_ids'].extend(token_type_ids)
+        return chunk1
+    # Note: the long query will be truncated to 256 tokens by default
+    query_inputs = tokenizer.encode_plus(
+        query, truncation=True, padding=False, max_length=max_query_length
+    )
+    max_passage_inputs_length = max_length - len(query_inputs['input_ids']) - 2
+    # assert (
+    #     max_passage_inputs_length > 100
+    # ), "Your query is too long! Please make sure your query less than 500 tokens!"
+    overlap_tokens_implt = min(overlap_tokens, max_passage_inputs_length // 4)
+    res_merge_inputs = []
+    res_merge_inputs_pids = []
+    for pid, passage in enumerate(passages):
+        passage_inputs = tokenizer.encode_plus(
+            passage,
+            truncation=False,
+            padding=False,
+            add_special_tokens=False,
+            max_length=0,
+        )
+        passage_inputs_length = len(passage_inputs['input_ids'])
+        if passage_inputs_length <= max_passage_inputs_length:
+            qp_merge_inputs = _merge_inputs(query_inputs, passage_inputs)
+            res_merge_inputs.append(qp_merge_inputs)
+            res_merge_inputs_pids.append(pid)
+        else:
+            start_id = 0
+            while start_id < passage_inputs_length:
+                end_id = start_id + max_passage_inputs_length
+                # make sure the length of the last chunk is `max_passage_inputs_length`
+                if end_id >= passage_inputs_length:
+                    sub_passage_inputs = {
+                        k: v[-max_passage_inputs_length:]
+                        for k, v in passage_inputs.items()
+                    }
+                else:
+                    sub_passage_inputs = {
+                        k: v[start_id:end_id] for k, v in passage_inputs.items()
+                    }
+                start_id = (
+                    end_id - overlap_tokens_implt
+                    if end_id < passage_inputs_length
+                    else end_id
+                )
+                qp_merge_inputs = _merge_inputs(query_inputs, sub_passage_inputs)
+                res_merge_inputs.append(qp_merge_inputs)
+                res_merge_inputs_pids.append(pid)
+    return res_merge_inputs, res_merge_inputs_pids

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e802fe5337779428818439760a1e6161ed36ceed72d4ebcbda9c139a2108fc99
+size 17082988

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 1024,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizerFast",
+  "unk_token": "<unk>"
+}

xlm_padding.py ADDED Viewed

	@@ -0,0 +1,218 @@

+# This implementation was adapted from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/block.py
+# Commit id: c94cd09744d20f0ac587a351ff6ff2e8ad11ae1b
+# Previously adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
+import torch
+import torch.nn.functional as F
+from einops import rearrange, repeat
+class IndexFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        # return input[indices]
+        return torch.gather(
+            rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
+        ).reshape(-1, *other_shape)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        grad_output = rearrange(grad_output, "b ... -> b (...)")
+        grad_input = torch.zeros(
+            [ctx.first_axis_dim, grad_output.shape[1]],
+            device=grad_output.device,
+            dtype=grad_output.dtype,
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        # grad_input[indices] = grad_output
+        grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis = IndexFirstAxis.apply
+class IndexPutFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, values, indices, first_axis_dim):
+        ctx.save_for_backward(indices)
+        assert indices.ndim == 1
+        assert values.ndim >= 2
+        output = torch.zeros(
+            first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        output[indices] = values
+        # output.scatter_(0, repeat(indices, 'z -> z d', d=values.shape[1]), values)
+        return output
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        grad_values = grad_output[indices]
+        # grad_values = torch.gather(grad_output, 0, repeat(indices, 'z -> z d', d=grad_output.shape[1]))
+        return grad_values, None, None
+index_put_first_axis = IndexPutFirstAxis.apply
+class IndexFirstAxisResidual(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        output = input[indices]
+        # We don't want to reshape input (b ... -> b (...)) since it could change the channel_last
+        # memory format to channel_first. In other words, input might not be contiguous.
+        # If we don't detach, Pytorch complains about output being a view and is being modified inplace
+        return output, input.detach()
+    @staticmethod
+    def backward(ctx, grad_output, grad_residual):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        assert grad_residual.shape[1:] == other_shape
+        grad_input = grad_residual
+        # grad_input[indices] += grad_output
+        indices = indices.reshape(indices.shape[0], *((1,) * (grad_output.ndim - 1)))
+        indices = indices.expand_as(grad_output)
+        grad_input.scatter_add_(0, indices, grad_output)
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis_residual = IndexFirstAxisResidual.apply
+def unpad_input(hidden_states, attention_mask):
+    """
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+        cu_seqlens: (batch + 1), the cumulative sequence lengths, used to index into hidden_states.
+        max_seqlen_in_batch: int
+    """
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    return (
+        index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices),
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+def unpad_input_for_concatenated_sequences(hidden_states, attention_mask_in_length):
+    """
+    Supports concatenating short samples in one sequence. The attention_mask_in_length is utilized to mask other short samples. It helps efficient training of variant lengths-based samples (e.g., the supervised fine-tuning task in large language model).
+    The motivation for this function is explained [here](https://github.com/Dao-AILab/flash-attention/issues/432#issuecomment-1668822286).
+    For example, if batch = 3 and seqlen = 6, the attention_mask_in_length is:
+        ```
+        [
+          [2, 3, 0, 0, 0, 0],
+          [3, 2, 0, 0, 0, 0],
+          [6, 0, 0, 0, 0, 0]
+        ]
+        ```
+    , which refers to the 3D-attention mask:
+        ```
+        [
+          [
+            [1, 0, 0, 0, 0, 0],
+            [1, 1, 0, 0, 0, 0],
+            [0, 0, 1, 0, 0, 0],
+            [0, 0, 1, 1, 0, 0],
+            [0, 0, 1, 1, 1, 0],
+            [0, 0, 0, 0, 0, 1]
+          ],
+          [
+            [1, 0, 0, 0, 0, 0],
+            [1, 1, 0, 0, 0, 0],
+            [1, 1, 1, 0, 0, 0],
+            [0, 0, 0, 1, 0, 0],
+            [0, 0, 0, 1, 1, 0],
+            [0, 0, 0, 0, 0, 1]
+          ],
+          [
+            [1, 0, 0, 0, 0, 0],
+            [1, 1, 0, 0, 0, 0],
+            [1, 1, 1, 0, 0, 0],
+            [1, 1, 1, 1, 0, 0],
+            [1, 1, 1, 1, 1, 0],
+            [1, 1, 1, 1, 1, 1]
+          ]
+        ]
+        ```.
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask_in_length: (batch, seqlen), int, a nonzero number (e.g., 1, 2, 3, etc.) means length of concatenated sequence in b-th batch, and 0 means none.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+        cu_seqlens: (batch + 1), the cumulative sequence lengths, used to index into hidden_states.
+        max_seqlen_in_batch: int
+    """
+    length = attention_mask_in_length.sum(dim=-1)
+    seqlen = attention_mask_in_length.size(-1)
+    attention_mask_2d = torch.arange(seqlen, device=length.device, dtype=length.dtype).expand(len(length),
+                                                                                              seqlen) < length.unsqueeze(
+        1)
+    real_indices_idx = torch.nonzero(attention_mask_in_length.flatten(), as_tuple=False).flatten()
+    seqlens_in_batch = attention_mask_in_length.flatten()[real_indices_idx]
+    indices = torch.nonzero(attention_mask_2d.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    return (
+        index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices),
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+def pad_input(hidden_states, indices, batch, seqlen):
+    """
+    Arguments:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), the indices that represent the non-masked tokens of the original padded input sequence.
+        batch: int, batch size for the padded sequence.
+        seqlen: int, maximum sequence length for the padded sequence.
+    Return:
+        hidden_states: (batch, seqlen, ...)
+    """
+    dim = hidden_states.shape[-1]
+    # output = torch.zeros((batch * seqlen), dim, device=hidden_states.device, dtype=hidden_states.dtype)
+    # output[indices] = hidden_states
+    output = index_put_first_axis(hidden_states, indices, batch * seqlen)
+    return rearrange(output, "(b s) ... -> b s ...", b=batch)