The AI Commons Project is a proof of concept of a new methodology of developing Artificial Intelligence solutions that allows anyone, anywhere to benefit from the possibilities that AI can provide. The project aims to increase/improve the accessibility, reproducibility, contextualization and enhancement of Artificial Intelligence solutions globally and especially in emerging markets.
The project aims to demonstrate how a global community of AI experts can learn and co-create mutually beneficial solutions with the opportunity for cross-county incremental enhancement.
Data Science Nigeria
Wuraola Oyewusi, a pharmacist and data scientist (Data Science Nigeria)
Conversations on social media are known to be casual, informal and open across several topics including medical related topics. Even though social media are good sources of information, its informal nature and the use of pseudonyms have made information extraction difficult for medical related applications.
This problem is faced by everyone building a solution that requires medical(clinical) data from social media.
This solution extracts medical information(entity) such as symptoms, diseases, drugs, organisms and others medical related entities from social media text which can be used for NLP applications such as information extraction, summarization, and data mining.
The output of the solution is an highlight/a list of all health related words and the class of information(entity) present in a given text.
The solution was trained to identify 14 entities namely:
PERSON (Any Human)
SYMPTOM (Symptom of any disease)
MEDICAL FIELD (Medical speciality)
DRUG (Medicinal product)
FOOD (Edible and source of Nutrients)
DOSAGE (Dosage of Medication)
BODY PART(Part of the body)
PLACE (Location, Town, City)
MEDICAL PROCEDURE (Medical Procedure and processes)
DISEASE (Illnesses)
ORGANISM (Causative organism or disease vector)
INJURY (Breakage in skin continuity)
PHYSIOLOGIC PROCESS (Biological Processes)
ADVERSE REACTION (Unintended consequences of medication or food).
Health practitioners
Data scientists/machine learning engineers
Social media users.
Machine learning with emphasis on Natural Language Processing(NLP) and Medical health knowledge.
N/A.
Increasing the number of labeled entities
Enriching the data source and format from more diverse data sources.
The intended use is extract medical information from social media text.
Builders of AI solutions in health.
A user feeds the model with a text and the text is returned on the screen with all the medical related entities highlighted with the class entity (such as Person, Symptom, Drug etc.) on the screen.
The solution can be made to read user’s incoming text automatically and return a notification appropriately.
The solution was developed by a pharmacist.
Data was scraped from an open public forum where general discussion about health, medical conditions are shared on the social media platform.
The dataset was created mainly for this project but it can be extended and used for similar problem scope.
The solution implementer.
The dataset contains texts related to health topics.
A total of 12029 text documents were used for training and evaluation.
Each instance contains unprocessed text file.
Yes, the dataset is self contained. Though it was scraped online, it doesn’t rely on the external sources from which it was gotten in order to be used. The collected dataset is constant as it is captured with reference to the date and time it was scraped. There are no restrictions such as licenses or fees of any kind assiotiated with any of the external resources.
No.
The data was scraped from the social media platform in form of a raw text.
Yes, it was randomly sampled from the non-exhaustive data available on the internet.
The solution implementer
Yes, tokenization was done and the health-related entities were labelled using the TagEditor(v1.5) annotation tool. The annotated Data was then converted to spaCy gold format.
Yes, Click HERE too access.
Yes. Sklearn, Spacy, TagEditor……are all open sourc.
Tasks related to the solution.
Data Science Nigeria
A message can be sent by filling the form on https://backup.datasciencenigeria.org/contact-us/
Yes. The dataset will be updated from time to time by Data Science Nigeria. When theres an update, the documentation will be updated as well.
Yes. Any changes made will be updated in the documentation.
Model date: 2019. The data was trained using the spaCy , an open-source software library for advanced NLP using all default training parameters.
The health related entities in the data scrapped from socia media were labelled using the TagEditor(v1.5) annotation tool. The annotated Data was converted to spaCy gold format then data format was confirmed using the spaCy command line debugger “!python -m spacy debug-data en”
No.
For the implementation of this solution, 10610 training documents and 1419 evaluation documents were used and the evalution metrics used are F1 Score, Recall and Precision.
All the default training parameters was reatrained the mode result shows Precision of 71.711 %, Recall of 72.458 % and F1 score of 72.082%
30 iterations.
The result was reported using Precision, Recall and F1 score.
Google colab was used with GPU enabled.
To reproduce the solution:
Download all files from HERE
Run all the cells in the notebook at /Nairaland_NER.ipynb
Solution is not suitable for any application not directly related to it.
We are not collecting usage data.
Copyright © 2020 Data Science Nigeria.