add infos

34dbd4e7 · Mouhamadou Ba · 2d8b7bc5 · 34dbd4e7
Commit 34dbd4e7 authored 4 months ago by Mouhamadou Ba
--- a/README_.md
+++ b/README_.md
@@ -6,7 +6,7 @@ The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliom

 The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html).

-> **_NOTE:_** Adaptations may be needed for other environments. See additional documentation [here](docs/README.md).
+> **_NOTE:_** Adaptations are required for other environments. See additional documentation [here](docs/README.md).

 ## Install

@@ -78,7 +78,7 @@ ALVISIR_HOME : "softwares/alvisir-install"

 ### **4.** add supertaxo 

-You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy), then after copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`
+You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy) to create the extented taxo, then copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`

 ```
 extended-microorganisms-taxonomy/output/bacdive-match \
@@ -103,8 +103,7 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

-> **_NOTE:_**  The corpus splitted into batches is stored into `corpora/pubmed/`, its is required for next steps
-
+> **_NOTE:_**  The corpus splitted into batches `corpora/pubmed/batches`, its is required for next steps
 > **_NOTE:_** Execution time ~ 7 hours 

 ### **step 2.** ` Get EPMC FullTexts` 
@@ -116,10 +115,8 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

-> **_NOTE:_** The corpus splitted into batches is stored into `corpora/epmc/`, its is required for next steps
-
+> **_NOTE:_** The corpus splitted into batches `corpora/epmc/batches`, its is required for next steps
 > **_NOTE:_** .snakemake/log/2024-10-22T103433.420025.snakemake.log
-
 > **_NOTE:_** Execution time ~ 7 hours 


@@ -133,7 +130,7 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 ```

 > **_NOTE:_**  check snakefile for outputs, they are required for next steps
->**_NOTE:_**  execution time ~ 40 mn
+> **_NOTE:_**  execution time ~ 40 mn

 ### **step 3.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. -->

@@ -144,6 +141,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

+> **_NOTE:_**  results: `corpora/florilege/pubmed/PubMed-*.txt`
+> **_NOTE:_** Execution time ~ ?
+
 ### **step 3.2** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. -->

 ```
@@ -153,6 +153,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

+> **_NOTE:_**  results: `corpora/florilege/cirm/cirm-*-results.txt`
+> **_NOTE:_** Execution time ~ ?
+
 ### **step 3.3** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.-->

 ```
@@ -162,6 +165,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

+> **_NOTE:_**  results: `corpora/florilege/genbank/genbank-results.txt`
+> **_NOTE:_** Execution time ~ ?
+
 ### **step 3.4.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. -->

 ```
@@ -171,6 +177,9 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --restart-times 4 all
 ```

+> **_NOTE:_**  results: `corpora/florilege/dsmz/dsmz-results.txt`
+> **_NOTE:_** Execution time ~ ?
+
 ### Run eval process

 ### **step 3.4.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. -->
@@ -181,5 +190,6 @@ snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --re
 --cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
 --restart-times 4 all
 ```
-> **_NOTE:_** the scores are available here `corpora/florilege/eval/new/BB19-*-eval.json`
+
+> **_NOTE:_** scores: `corpora/florilege/eval/new/BB19-*-eval.json`
 > **_NOTE:_** execution time ~ 15 mn