Commit 786ac72d authored by lgehring's avatar lgehring
Browse files

Update readme

parent b5ec7e30
......@@ -45,8 +45,27 @@
<!-- APPROACH -->
## Approach
We decided on using embeddings to represent the carciogenesis dataset in a efficient form. This was done using the PyKeen library, which offers a myriad of different embedding models. Further it can be configured with different parameters like the number of epochs or the dimension of the generated embedding. From this we tried out different approaches for classification ranging from clustering algorithms like KNN to machine learning approaches using the sklearn library. We settled on using random forrests in conjunction with embeddings generated using the \<model_name> model. Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps. We split the data in to learning and test in a ratio
of \<ratio>.
We decided on using embeddings to represent the carciogenesis dataset in an efficient form.
This was done using the PyKeen library, which offers a myriad of different embedding models.
Further it can be configured with different parameters like the number of epochs, or the dimension
of the generated embedding.
To make predictions using these embeddings, we first used typical machine learning algorithms such as
random forests, logistic regression, or clustering algorithms such as kNN. In doing so, we encountered
the problem that very many of the learning problems have a very unbalanced ratio of positive and negative
(included and excluded) instances.
For learning problems that had an extremely high proportion of negative (excluded) instances,
the classification algorithms classified all instances as negative, since these mostly optimize the accuracy
instead of the F1 score.
To overcome this problem, we tried to balance the training data before the training. Since undersampling,
with a very small amount of positive instances leads to a very small training data set,
we therefore decided to oversample. The oversampling algorithm we used is the SMOTE implementation of the sklearn extension
imbalenced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In simple terms, SMOTE calculates
new synthetic data points for the smaller class, each of which lies on the line between two data points of this class.
Using this technique and a Linear SVM, we were able to at least slightly improve the problem of overweighting the negative class.
Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.
We split the data in to learning and test in a ratio of \<ratio>.
<!-- PREREQISITES -->
......@@ -64,7 +83,9 @@ of \<ratio>.
* numpy (1.19.1)
* rdflib (5.0.0)
* scikit_learn (0.24.2)
* [TODO]
* imbalanced-learn
* scipy (>=0.19.1)
* joblib(>=0.11)
<!-- GETTING STARTED -->
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment