Skip to content
Snippets Groups Projects
Commit 34dbd4e7 authored by Mouhamadou Ba's avatar Mouhamadou Ba
Browse files

add infos

parent 2d8b7bc5
No related branches found
No related tags found
No related merge requests found
......@@ -6,7 +6,7 @@ The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliom
The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html).
> **_NOTE:_** Adaptations may be needed for other environments. See additional documentation [here](docs/README.md).
> **_NOTE:_** Adaptations are required for other environments. See additional documentation [here](docs/README.md).
## Install
......@@ -78,7 +78,7 @@ ALVISIR_HOME : "softwares/alvisir-install"
### **4.** add supertaxo
You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy), then after copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`
You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy) to create the extented taxo, then copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`
```
extended-microorganisms-taxonomy/output/bacdive-match \
......@@ -103,8 +103,7 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** The corpus splitted into batches is stored into `corpora/pubmed/`, its is required for next steps
> **_NOTE:_** The corpus splitted into batches `corpora/pubmed/batches`, its is required for next steps
> **_NOTE:_** Execution time ~ 7 hours
### **step 2.** ` Get EPMC FullTexts`
......@@ -116,10 +115,8 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** The corpus splitted into batches is stored into `corpora/epmc/`, its is required for next steps
> **_NOTE:_** The corpus splitted into batches `corpora/epmc/batches`, its is required for next steps
> **_NOTE:_** .snakemake/log/2024-10-22T103433.420025.snakemake.log
> **_NOTE:_** Execution time ~ 7 hours
......@@ -133,7 +130,7 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
```
> **_NOTE:_** check snakefile for outputs, they are required for next steps
>**_NOTE:_** execution time ~ 40 mn
> **_NOTE:_** execution time ~ 40 mn
### **step 3.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. -->
......@@ -144,6 +141,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/pubmed/PubMed-*.txt`
> **_NOTE:_** Execution time ~ ?
### **step 3.2** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. -->
```
......@@ -153,6 +153,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/cirm/cirm-*-results.txt`
> **_NOTE:_** Execution time ~ ?
### **step 3.3** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.-->
```
......@@ -162,6 +165,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/genbank/genbank-results.txt`
> **_NOTE:_** Execution time ~ ?
### **step 3.4.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. -->
```
......@@ -171,6 +177,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/dsmz/dsmz-results.txt`
> **_NOTE:_** Execution time ~ ?
### Run eval process
### **step 3.4.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. -->
......@@ -181,5 +190,6 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** the scores are available here `corpora/florilege/eval/new/BB19-*-eval.json`
> **_NOTE:_** scores: `corpora/florilege/eval/new/BB19-*-eval.json`
> **_NOTE:_** execution time ~ 15 mn
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment