El genio lingüístico: noviembre 2023

miércoles, 22 de noviembre de 2023

jq example parse json

Try extract only the field text from a JSON Lines file called example-english-corpus.jsonl:

{
    "docid": "39#0",
    "title": "Albedo",
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

$ jq -r '.text' example-english-corpus.jsonl
Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation).

So, the question of the day is: save output as txt or jsonl? NLTK, spaCy, Gensim, etc., suppose a txt input.

lunes, 20 de noviembre de 2023

pato en nahuatl

Cuac. Hay una autopista que te lleva de volada a Tuxtla Gutiérrez, Chis. Se llama Cosoleacaque-Cuacnopalan. Debe atravesar la Sierra Madre Oriental, no sé. Yo ni siquiera conozco.

A fe mía, la naturaleza le proporcionó una idea al hombre salvaje de cómo nombrar las cosas: No fue Adán, fue al revés, mediante onomatopeya. ¡Cuac!