Update the readme file

23fea354 · AjUm-HEIDI · dbefed65 · 23fea354 · 23fea354 · dbefed65
Commit 23fea354 authored 5 months ago by AjUm-HEIDI
--- a/.env.example
+++ b/.env.example
+OPEN_AI_API_KEY=""
\ No newline at end of file
--- a/README.md
+++ b/README.md
-# Experiment Runner
+# Explaining GNN with High Level Concepts

-This README provides instructions for running experiments on structured and text-based datasets using the provided Python script.
+This repository contains the code for the thesis "Explaining Graph Neural Networks using High Level Concepts" of the University of Paderborn. The goal of the project is to detect and enrich an OWL ontology using high-level concepts and state how to try to reduce the concept length. This README provides instructions for running experiments on structured and text-based datasets using the provided Python script.

 ## Overview

@@ -9,81 +9,119 @@ The script allows for running experiments on two types of datasets: structured a
 ## Requirements

 - Python 3.6 or higher
+- OpenAI API key
 - Dependencies are listed in the `requirements.txt` file.

-### Installing Dependencies
+## Setup Instructions

-Before running the experiments, install the necessary Python packages using the following command:
+### 1. Virtual Environment Setup
+First, create and activate a virtual environment:
+
+```bash
+# Create virtual environment
+python -m venv venv
+
+# Activate virtual environment
+# On Windows
+venv\Scripts\activate
+# On Unix or MacOS
+source venv/bin/activate
+```
+
+### 2. Install Dependencies
+After activating the virtual environment, install the required packages:

 ```bash
 pip install -r requirements.txt
 ```

-Ensure that you have Python and pip installed on your system. If you encounter any issues during the installation, please check that your Python environment is correctly set up.
-
-## Configuration
-
-The script uses a JSON configuration file to determine which datasets are available for experiments. The structure of the configuration file should be as follows:
-
-```json
-{
-    "structured": [
-        {
-            "datasetName": "BA2Motif"
-        },
-        {
-            "datasetName": "MultiShape"
-        },
-        {
-            "datasetName": "MUTAG"
-        }
-    ],
-    "text": [
-        {
-            "datasetName": "dblp",
-            "grouped_keyword_dir": "rawData/dblp/groups",
-            "entity_name": "author"
-        },
-        {
-            "datasetName": "imdb",
-            "grouped_keyword_dir": "rawData/imdb/groups",
-            "entity_name": "movie"
-        }
-    ]
-}
+### 3. Environment Configuration
+Create a `.env` file in the root directory and add your OpenAI API key:
+```
+OPEN_AI_API_KEY="your-api-key-here"
 ```

-## How to Run
+### 4. Data Setup
+Create a `rawData` folder in the root directory and add the following datasets:
+
+1. For DBLP dataset:
+   - Create a `dblp` folder inside `rawData`
+   - Download data from [DBLP Dataset](https://github.com/Jhy1993/HAN/tree/master/data/DBLP_four_area)
+   - Place all files in the `rawData/dblp` folder
+
+2. For IMDB dataset:
+   - Create an `imdb` folder inside `rawData`
+   - Download data from [IMDB Dataset](https://github.com/Jhy1993/HAN/tree/master/data/imdb)
+   - Place all files in the `rawData/imdb` folder
+
+```
+
+### 5. Run Tests
+Before running experiments, verify the setup by running tests:
+```bash
+pytest -v
+```

-1. **Setup the Configuration File**: Ensure your `config.json` file is set up as described in the Configuration section and is located in the same directory as your script, or provide the path to it when running the script.
+## How to Run Experiments

-2. **Running Experiments**:
-    - Use the command-line interface to specify the type of dataset and optionally target a specific dataset within that type.
-    - You can specify the dataset type (`structured` or `text`) and optionally target a specific dataset.
+1. **Running Experiments**:
+    - Ensure your virtual environment is activated
+    - Use the command-line interface to specify the type of dataset and optionally target a specific dataset within that type
+    - You can specify various parameters to customize the experiment execution

 ### Command-Line Arguments

 - `-c, --config`: Path to the configuration file. Defaults to `config.json`.
- `-t, --type`: Type of dataset to run the experiments on. Choices are `structured` or `text`. Defaults to `structured`.
+- `-i, --iterations`: Number of times to run the experiment. Must be a positive integer. Defaults to 5.
+- `-t, --type`: Type of dataset to run the experiments on. Choices are:
+  - `s` or `structured`: For structured datasets
+  - `t` or `text`: For text datasets
+  - If not specified, runs experiments on both types
 - `-d, --dataset`: Specific dataset name to run. Optional.
+- `-n, --num_groups`: List of group sizes to run experiments with. Space-separated integers. Optional.
+  - Default: [0, 5, 10, 15, 20, 25]
+- `-l, --labels`: List of labels to run experiments with. Space-separated integers. Optional.
+- `-b, --boolean_concepts`: Whether to create high-level concepts as boolean values.
+  - Must explicitly state 'true' or 'false'
+  - Defaults to 'true'
+- `-e, --use_experimented_groups`: Flag to use experimented groups instead of creating new grouped keywords.
+  - No value needed, just include the flag to enable
+- `-p, --penalty`: Set the penalty value for evolearner. Defaults to 1.
+- `--title`: Title to append to the results folder name. Optional.
+  - Will be converted to lowercase and stripped of non-alphanumeric characters
+  - Added with an underscore prefix to the folder name

 ### Examples

 - **Run All Structured Datasets**:
-
 ```bash
 python main.py --type structured
 ```

 - **Run a Specific Structured Dataset**:
-
 ```bash
 python main.py --type structured --dataset BA2Motif
 ```

- **Run a Specific Text Dataset**:
+- **Run a Text Dataset with Custom Parameters**:
+```bash
+python main.py --type text --dataset dblp --iterations 10 --num_groups 5 10 15 --boolean_concepts true --penalty 2
+```

+- **Run Multiple Iterations only on Certian Labels**:
 ```bash
-  python main.py --type text --dataset dblp
+python main.py --type text --dataset imdb --iterations 3 --labels 0 1  --title experiment_batch1
 ```

+- **Use Experimented Groups with Text Dataset**:
+```bash
+python main.py --type text --dataset dblp --use_experimented_groups --penalty 3
+```
+
+### Output
+
+The results will be saved in a folder with a name based on the experiment parameters and any provided title. For example, if you run an experiment with the title "batch1", the results will be saved in a folder in the format "{{timestamp}}_batch1".
+
+---
+
+**Note**: Parts of the DiscriminativeExplainer and ConvertToOWL files were adapted and modified from the [PG-XGNN project](https://git.cs.uni-paderborn.de/pg-xgnn/pg-xgnn).
\ No newline at end of file
--- a/Utils/Evaluator.py
+++ b/Utils/Evaluator.py
-import functools
-import numpy as np
-from typing import Set, Tuple, List
-from ontolearn.owlapy.model import OWLClassExpression, OWLObjectComplementOf, OWLObjectUnionOf, \
-    OWLObjectIntersectionOf, OWLObjectSomeValuesFrom, OWLObjectAllValuesFrom, OWLObjectMaxCardinality, \
-    OWLObjectMinCardinality, OWLClass, OWLObjectProperty, OWLDataSomeValuesFrom, OWLObjectOneOf
-from torch_geometric.data import HeteroData
-
-
-class Evaluator:
-    """ An evaluator which is able to evaluate the accuracy of a given logical formula based on a given dataset."""
-
-    def __init__(self, data: HeteroData):
-        """
-        Initializes the evaluator based on the given dataset.
-
-        Args:
-            data: The dataset which should be used for evaluation.
-        """
-        self._data = data
-        self._nodeset = self._get_nodeset()
-
-        self.owl_mapping = {
-            OWLObjectComplementOf: self._eval_complement,
-            OWLObjectUnionOf: self._eval_union,
-            OWLObjectIntersectionOf: self._eval_intersection,
-            OWLObjectSomeValuesFrom: self._eval_existential,
-            OWLObjectAllValuesFrom: self._eval_universal,
-            OWLObjectMaxCardinality: self._eval_max_cardinality,
-            OWLObjectMinCardinality: self._eval_min_cardinality,
-            OWLClass: self._eval_class,
-            OWLDataSomeValuesFrom: self._eval_property_value,
-            OWLObjectOneOf: self._eval_object_one_of
-        }
-
-    @property
-    def data(self) -> HeteroData:
-        """
-        The dataset which should be used for evaluation.
-
-        Returns:
-            The dataset which should be used for evaluation.
-        """
-        return self._data
-
-    @data.setter
-    def data(self, val: HeteroData) -> None:
-        """
-        Sets the dataset which should be used for evaluation to the given value.
-
-        Args:
-            val: The dataset which should be used for evaluation.
-        """
-        self._data = val
-
-    def explanation_accuracy(self, ground_truth: Set[Tuple[int, str]], logical_formula: OWLClassExpression) -> Tuple[float, float, float]:
-        """
-        Calculates the explanation accuracy of the given logical formula based on the given ground truth.
-
-        Args:
-            ground_truth: The ground truth which should be used for evaluation.
-            logical_formula: The logical formula which should be evaluated.
-
-        Returns:
-            A triple containing the precision, recall, and accuracy of the given logical formula based on the ground truth.
-        """
-        tp, fp, tn, fn = self._get_positive_negatives(ground_truth, logical_formula)
-        if tp + fp == 0:
-            return 0, tp / (tp + fn), (tp + tn) / (tp + fp + tn + fn)
-
-        return tp / (tp + fp), tp / (tp + fn), (tp + tn) / (tp + fp + tn + fn)
-
-    def f1_score(self, ground_truth: Set[Tuple[int, str]], logical_formula: OWLClassExpression) -> float:
-        """
-        Calculates the F1 score of the given logical formula based on the given ground truth.
-
-        Args:
-            ground_truth: The ground truth which should be used for evaluation.
-            logical_formula: The logical formula which should be evaluated.
-
-        Returns:
-            The F1 score of the given logical formula based on the ground truth.
-        """
-        tp, fp, _, fn = self._get_positive_negatives(ground_truth, logical_formula)
-        return (2 * tp) / (2 * tp + fp + fn)
-
-    def _get_positive_negatives(self, ground_truth: Set[Tuple[int, str]], logical_formula: OWLClassExpression) -> Tuple[float, float, float, float]:
-        """
-        Calculates the sizes of the true positives, false positives, true negatives, and false negatives of the given logical formula.
-
-        Args:
-            ground_truth: The ground truth which should be used for evaluation.
-            logical_formula: The logical formula which should be evaluated.
-
-        Returns:
-            A tuple containing the sizes of the true positives, false positives, true negatives, and false negatives.
-        """
-        explanation_set = self._eval_formula(logical_formula)
-        true_positives = len(explanation_set & ground_truth)
-        false_positives = len(explanation_set - ground_truth)
-        false_negatives = len(ground_truth - explanation_set)
-        true_negatives = self.data.num_nodes - true_positives - false_positives - false_negatives
-
-        return true_positives, false_positives, true_negatives, false_negatives
-
-    @functools.lru_cache(maxsize=100)
-    def _eval_formula(self, logical_formula: OWLClassExpression) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given logical formula based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The logical formula which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        return self.owl_mapping[type(logical_formula)](logical_formula)
-
-    def _eval_complement(self, logical_formula: OWLObjectComplementOf) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given complement based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The complement which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        inner_set = self._eval_formula(logical_formula.get_operand())
-        return self._nodeset - inner_set
-
-    def _eval_union(self, logical_formula: OWLObjectUnionOf) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given union based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The union which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        operands = list(logical_formula.operands())
-        result = set()
-        for i in operands:
-            result = result | self._eval_formula(i)
-        return result
-
-    def _eval_intersection(self, logical_formula: OWLObjectIntersectionOf) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given intersection based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The intersection which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        operands = list(logical_formula.operands())
-        result = self._eval_formula(operands[0])
-        for i in operands[1:]:
-            result = result & self._eval_formula(i)
-        return result
-
-    def _eval_existential(self, logical_formula: OWLObjectSomeValuesFrom) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given existential restriction based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The existential restriction which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        dest = self._eval_formula(logical_formula.get_filler())
-        edge_type = self._eval_property(logical_formula.get_property())
-        dest_first_elements = np.array([b[0] for b in dest])
-        selection = np.isin(self.data[edge_type]['edge_index'][1].cpu(), dest_first_elements)
-        origin = self.data[edge_type]['edge_index'][0][selection].cpu().numpy()
-        return set(zip(origin, [edge_type[0], ] * len(origin)))
-
-    def _eval_object_one_of(self, logical_formula: OWLObjectOneOf) -> Set[Tuple[int, str]]:
-        """
-        Evaluate an OWL ObjectOneOf logical formula and return a set of tuples representing nodes that match the condition.
-
-        Args:
-            logical_formula: The OWL ObjectOneOf logical formula to evaluate.
-
-        Returns:
-            A set of tuples where each tuple represents a node that matches the condition.
-            Each tuple contains two elements: an integer representing the index and a string representing the node type.
-        """
-        nodes = set()
-        individuals = list(logical_formula.individuals())
-        for individual in individuals:
-            node_type, index = individual.get_iri().get_remainder().split('#')
-            nodes.add((int(index), node_type))
-        return nodes
-
-    def _eval_universal(self, logical_formula: OWLObjectAllValuesFrom) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given universal restriction based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The universal restriction which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        dest = set(self._eval_formula(logical_formula.get_filler()))
-        edge_type = self._eval_property(logical_formula.get_property())
-        result = set()
-
-        mapping = dict()
-
-        edge_index_0 = self.data[edge_type]["edge_index"][0].cpu().numpy()
-        edge_index_1 = self.data[edge_type]["edge_index"][1].cpu().numpy()
-
-        for i in range(len(edge_index_0)):
-            idx_0 = edge_index_0[i].item()
-            idx_1 = edge_index_1[i].item()
-
-            if idx_0 not in mapping:
-                mapping[idx_0] = [idx_1]
-            else:
-                mapping[idx_0].append(idx_1)
-
-        for i, indices in mapping.items():
-            check_set = {(idx, edge_type[2]) for idx in indices}
-            if check_set.issubset(dest):
-                result.add((i, edge_type[0]))
-
-        return result
-
-    def _eval_max_cardinality(self, logical_formula: OWLObjectMaxCardinality) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given max cardinality restriction based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The max cardinality restriction which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        dest = set(self._eval_formula(logical_formula.get_filler()))
-        edge_type = self._eval_property(logical_formula.get_property())
-        cardinality = logical_formula.get_cardinality()
-        result = set()
-
-        mapping = dict()
-
-        edge_index_0 = self.data[edge_type]["edge_index"][0].cpu().numpy()
-        edge_index_1 = self.data[edge_type]["edge_index"][1].cpu().numpy()
-
-        for i in range(len(edge_index_0)):
-            idx_0 = edge_index_0[i].item()
-            idx_1 = edge_index_1[i].item()
-
-            if idx_0 not in mapping:
-                mapping[idx_0] = [idx_1]
-            else:
-                mapping[idx_0].append(idx_1)
-
-        for i, indices in mapping.items():
-            check_set = {(idx, edge_type[2]) for idx in indices}
-            if len(check_set) <= cardinality and check_set.issubset(dest):
-                result.add((i, edge_type[0]))
-
-        return result
-
-    def _eval_min_cardinality(self, logical_formula: OWLObjectMinCardinality) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given min cardinality restriction based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The min cardinality restriction which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        dest = set(self._eval_formula(logical_formula.get_filler()))
-        edge_type = self._eval_property(logical_formula.get_property())
-        cardinality = logical_formula.get_cardinality()
-        result = set()
-
-        mapping = dict()
-
-        edge_index_0 = self.data[edge_type]["edge_index"][0].cpu().numpy()
-        edge_index_1 = self.data[edge_type]["edge_index"][1].cpu().numpy()
-
-        for i in range(len(edge_index_0)):
-            idx_0 = edge_index_0[i].item()
-            idx_1 = edge_index_1[i].item()
-
-            if idx_0 not in mapping:
-                mapping[idx_0] = [idx_1]
-            else:
-                mapping[idx_0].append(idx_1)
-
-        for i, indices in mapping.items():
-            check_set = {(idx, edge_type[2]) for idx in indices}
-            if len(check_set) >= cardinality and check_set.issubset(dest):
-                result.add((i, edge_type[0]))
-
-        return result
-
-    def _eval_class(self, logical_formula: OWLClass) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given class based on the given dataset and returns the set of matching nodes.
-
-        Args:
-            logical_formula: The class which should be evaluated.
-
-        Returns:
-            A set of nodes which are the result of the evaluation.
-        """
-        return self._get_nodeset([logical_formula.get_iri().get_remainder(), ])
-
-    def _eval_property_value(self, logical_formula: OWLDataSomeValuesFrom) -> Set[Tuple[int, str]]:
-        """
-        Evaluates the given OWLDataSomeValuesFrom logical formula based on the dataset and returns the set of nodes
-        that satisfy the specified property value condition.
-
-        Args:
-            logical_formula: The OWLDataSomeValuesFrom expression representing a property value condition.
-
-        Returns:
-            A set of nodes that satisfy the specified property value condition.
-                                 Each tuple contains the node index and node type.
-        """
-        nodes_matching_condition = set()
-
-        property_iri = logical_formula.get_property().get_iri().get_remainder()
-        facet_restriction = logical_formula.get_filler().get_facet_restrictions()[0]
-
-        property_split = property_iri.split('_')
-        node_type = property_split[0]
-        feature_index = int(property_split[-1]) - 1
-
-        operator = facet_restriction.get_facet().operator
-        comparison_value = facet_restriction.get_facet_value()._v
-
-        nodes = self.data[node_type]['x'].cpu().numpy()
-        for index, node in enumerate(nodes):
-            if operator(node[feature_index], comparison_value):
-                nodes_matching_condition.add((index, node_type))
-
-        return nodes_matching_condition
-
-    def _eval_property(self, property: OWLObjectProperty) -> Tuple[str, str, str]:
-        """
-        Evaluates the given property based on the given dataset and returns the edge type.
-
-        Args:
-            property: The property which should be evaluated.
-
-        Returns:
-            The edge type which is the result of the evaluation.
-        """
-        for i in self.data.edge_types:
-            if i[1] == property.get_iri().get_remainder():
-                return i
-
-    def _get_nodeset(self, node_types: List[str] = None) -> Set[Tuple[int, str]]:
-        """
-        Returns the set of nodes of the given node types.
-
-        Args:
-            node_types: The node types for which the nodes should be returned.
-
-        Returns:
-            The set of nodes of the given node types.
-        """
-        if node_types is None or node_types == ['Thing', ]:
-            node_types = self.data.node_types
-        if node_types == ['Nothing', ]:
-            return set()
-        result = set()
-        for i in node_types:
-            if "x" in self.data[i]:
-                result = result | set(enumerate([i] * self.data[i]["x"].shape[0]))
-            elif "num_nodes" in self.data[i]:
-                result = result | set(enumerate([i] * self.data[i]["num_nodes"]))
-        return result
--- a/structured_datasets_experiment.py
+++ b/structured_datasets_experiment.py
@@ -86,7 +86,7 @@ def explain_gnn(model, dataset, datasetName, run_dir, add_node_type, high_level_
            ])


-def experiment(datasetName: str, add_node_type = True, iterations: int = 1, create_high_level_concepts_as_boolean=None, selected_labels=None, title=""):
+def experiment(datasetName: str, add_node_type = True, iterations: int = 1, create_high_level_concepts_as_boolean=None, selected_labels=None, title="", penalty=1):
    """
    Run the experiment for the specified dataset multiple times.

@@ -123,7 +123,7 @@ def experiment(datasetName: str, add_node_type = True, iterations: int = 1, crea

        model = GNN(structuredDataset.dataset)
        print("Training model...")
-        metrics = model.train_model(epochs=300, lr=0.001)
+        metrics = model.train_model()

        original_labels = np.array([data.y.item() for data in structuredDataset.dataset])
        predicted_labels = model.predict_all()
@@ -151,7 +151,7 @@ def experiment(datasetName: str, add_node_type = True, iterations: int = 1, crea

        # Explain GNN before finding motifs
        print("\nBefore finding motifs:")
-        explain_gnn(model, structuredDataset.dataset, datasetName, run_dir, add_node_type)
+        explain_gnn(model, structuredDataset.dataset, datasetName, run_dir, add_node_type, penalty=penalty)

        print("\nDetecting motifs...")
        patterns, presence_matrix = structuredDataset.detect_motifs(visualizationPath=run_dir)
@@ -178,12 +178,12 @@ def experiment(datasetName: str, add_node_type = True, iterations: int = 1, crea
        if create_high_level_concepts_as_boolean is not False:
            print("\nRunning with boolean concepts:")
            explain_gnn(model, structuredDataset.dataset, datasetName, run_dir, add_node_type, 
-                    high_level_concepts, create_high_level_concepts_as_boolean=True, selected_labels=selected_labels)
+                    high_level_concepts, create_high_level_concepts_as_boolean=True, selected_labels=selected_labels, penalty=penalty)
        
        if create_high_level_concepts_as_boolean is not True:
            print("\nRunning with integer concepts:")
            explain_gnn(model, structuredDataset.dataset, datasetName, run_dir, add_node_type, 
-                    high_level_concepts, create_high_level_concepts_as_boolean=False, selected_labels=selected_labels)
+                    high_level_concepts, create_high_level_concepts_as_boolean=False, selected_labels=selected_labels, penalty=penalty)

        hypothesis_file = run_dir / "gnn_explanations.csv"
        if hypothesis_file.exists():