Try extract only the field text from a JSON Lines file called example-english-corpus.jsonl:
Dataset Card for MIRACL Corpus
{
"docid": "39#0",
"title": "Albedo",
"text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}
$ jq -r '.text' example-english-corpus.jsonl
Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation).
So, the question of the day is: save output as txt or jsonl? NLTK, spaCy, Gensim, etc., suppose a txt input.