Add Protein GO annotation task for scientific evaluation by suencgo · Pull Request #10 · InternScience/SciEvalKit

suencgo · 2026-03-03T09:59:09Z

Add ProteinGODataset supporting go_bp, go_cc, go_mf (Gene Ontology BP/CC/MF)
Add evaluation utilities with multi-format parsing (GPT, Qwen, etc.)
Add GO term mapping files for BP, CC, MF categories
Register ProteinGODataset in TEXT_DATASET for run.py integration

Made-with: Cursor

- Add ProteinGODataset supporting go_bp, go_cc, go_mf (Gene Ontology BP/CC/MF) - Add evaluation utilities with multi-format parsing (GPT, Qwen, etc.) - Add GO term mapping files for BP, CC, MF categories - Register ProteinGODataset in TEXT_DATASET for run.py integration Made-with: Cursor

gemini-code-assist · 2026-03-03T09:59:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new ProteinGODataset into the SciEvalKit, specifically designed for evaluating protein Gene Ontology annotation prediction models. It provides structured data loading for various GO categories and includes versatile evaluation utilities to process and compare predictions from diverse language models, thereby enhancing the platform's capabilities for scientific evaluation in bioinformatics.

Highlights

New Dataset Integration: Added ProteinGODataset to support Gene Ontology (GO) annotation prediction across Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) categories.
Enhanced Evaluation Utilities: Introduced robust evaluation utilities capable of parsing diverse model output formats (e.g., GPT, Qwen) for accurate GO term extraction and comparison.
GO Term Mappings: Included comprehensive mapping files for GO term IDs specific to the BP, CC, and MF branches.
System Integration: Registered the newly added ProteinGODataset within the TEXT_DATASET list, ensuring its seamless integration and availability for run.py.

Changelog

scieval/dataset/Protein_GO/init.py
- Initialized the Protein_GO package, exposing the ProteinGODataset.
scieval/dataset/Protein_GO/mappings/go_bp_mapping.txt
- Added GO term ID mappings for the Biological Process branch.
scieval/dataset/Protein_GO/mappings/go_cc_mapping.txt
- Added GO term ID mappings for the Cellular Component branch.
scieval/dataset/Protein_GO/mappings/go_mf_mapping.txt
- Added GO term ID mappings for the Molecular Function branch.
scieval/dataset/Protein_GO/protein_go.py
- Implemented the ProteinGODataset class, including data loading and exact match evaluation logic for GO annotation prediction.
scieval/dataset/Protein_GO/utils/init.py
- Initialized the utils subpackage, exposing GO evaluation functions.
scieval/dataset/Protein_GO/utils/protein_go.py
- Developed utilities for cleaning, parsing, extracting, normalizing, and evaluating GO term predictions from various model outputs.
scieval/dataset/init.py
- Updated dataset imports to include ProteinGODataset.
- Registered ProteinGODataset in the TEXT_DATASET list for broader system access.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new ProteinGODataset for evaluating protein Gene Ontology annotation prediction, including its dataset class, evaluation utilities, and GO term mapping files. However, significant security concerns exist regarding the handling of evaluation input files. The use of a generic load function on untrusted file paths could lead to Remote Code Execution (RCE) via insecure deserialization of pickle files and Server-Side Request Forgery (SSRF) via URL-based loading. Additionally, a lack of path sanitization when writing results may allow for arbitrary file writes. Strict validation on input file paths and formats is crucial to mitigate these risks. Beyond security, there are also areas for improvement in utility functions concerning Python best practices, such as avoiding wildcard imports, removing unused code, and using more specific exception handling.

gemini-code-assist · 2026-03-03T10:01:37Z

scieval/dataset/Protein_GO/utils/protein_go.py

+    - Parse prediction to GO terms, compare with answer
+    - Return DataFrame: ['Total', 'Correct', 'Accuracy']
+    """
+    data = load(eval_file)


The evaluate_protein_go function passes the eval_file argument directly to the load utility. The load function (defined in scieval/smp/file.py) is dangerous because it automatically downloads files from URLs (leading to SSRF) and uses pickle.load for files with a .pkl extension (leading to Remote Code Execution). An attacker could provide a malicious URL or a path to a crafted pickle file to compromise the system.

Recommendation: Validate that eval_file is a local file path and that its extension is one of the expected safe formats (e.g., .csv, .tsv, .json, .xlsx). Avoid using load on untrusted paths if it supports unsafe deserialization.

Suggested change

data = load(eval_file)

if not osp.exists(eval_file):

raise FileNotFoundError(f"Eval file {eval_file} not found.")

if not any(eval_file.endswith(ext) for ext in ['.csv', '.tsv', '.json', '.xlsx']):

raise ValueError(f"Unsupported file format for {eval_file}. Please use .csv, .tsv, .json, or .xlsx")

data = load(eval_file)

gemini-code-assist · 2026-03-03T10:01:38Z

scieval/dataset/Protein_GO/utils/protein_go.py

+def _try_extract_json_array(text: str) -> List[str] | None:
+    try:
+        import json
+        m = re.search(r'\[[\s\S]*?\]', text)
+        if m:
+            arr = json.loads(m.group())
+            if isinstance(arr, list) and all(isinstance(x, str) for x in arr):
+                return [x.strip() for x in arr if x.strip()]
+        m = re.search(r'\{[\s\S]*?\}', text)
+        if m:
+            obj = json.loads(m.group())
+            for key in ("annotations", "terms", "go_terms", "answer"):
+                if key in obj and isinstance(obj[key], list):
+                    return [str(x).strip() for x in obj[key] if str(x).strip()]
+    except Exception:
+        pass
+    return None


This function has a couple of areas for improvement:

The import json statement on line 69 should be moved to the top of the file with other imports. This is a standard Python convention (PEP 8) and improves code readability.

Catching a broad Exception on line 81 is risky as it can hide unexpected bugs. It's better to catch more specific exceptions. In this context, json.JSONDecodeError and TypeError would be more appropriate.

Suggested change

def _try_extract_json_array(text: str) -> List[str] | None:

try:

import json

m = re.search(r'\[[\s\S]*?\]', text)

if m:

arr = json.loads(m.group())

if isinstance(arr, list) and all(isinstance(x, str) for x in arr):

return [x.strip() for x in arr if x.strip()]

m = re.search(r'\{[\s\S]*?\}', text)

if m:

obj = json.loads(m.group())

for key in ("annotations", "terms", "go_terms", "answer"):

if key in obj and isinstance(obj[key], list):

return [str(x).strip() for x in obj[key] if str(x).strip()]

except Exception:

pass

return None

def _try_extract_json_array(text: str) -> List[str] | None:

try:

m = re.search(r'\[[\s\S]*?\]', text)

if m:

arr = json.loads(m.group())

if isinstance(arr, list) and all(isinstance(x, str) for x in arr):

return [x.strip() for x in arr if x.strip()]

m = re.search(r'\{[\s\S]*?\}', text)

if m:

obj = json.loads(m.group())

for key in ("annotations", "terms", "go_terms", "answer"):

if key in obj and isinstance(obj[key], list):

return [str(x).strip() for x in obj[key] if str(x).strip()]

except (json.JSONDecodeError, TypeError):

pass

return None

gemini-code-assist · 2026-03-03T10:01:38Z

scieval/dataset/Protein_GO/utils/protein_go.py

+    res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})
+    score_file = get_intermediate_file_path(eval_file, "_acc", "csv")
+    dump(res, score_file)


The evaluate_protein_go function constructs a result file path (score_file) by appending a suffix to the user-provided eval_file path and then writes to it using dump. Since eval_file is not sanitized, an attacker can use path traversal sequences (e.g., ../../) or absolute paths to cause the application to write evaluation results to arbitrary locations on the filesystem.

Recommendation: Sanitize the eval_file path to ensure it stays within an expected directory, or use only the basename of the file when constructing the output path.

Suggested change

res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})

score_file = get_intermediate_file_path(eval_file, "_acc", "csv")

dump(res, score_file)

res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})

# Use basename to prevent path traversal when writing results

output_filename = get_intermediate_file_path(osp.basename(eval_file), "_acc", "csv")

score_file = osp.join(osp.dirname(eval_file), output_filename)

dump(res, score_file)

gemini-code-assist · 2026-03-03T10:01:38Z

scieval/dataset/Protein_GO/protein_go.py

+
+from ..text_base import TextBaseDataset
+from .utils.protein_go import evaluate_protein_go
+from ...smp import *


Avoid using wildcard imports (from ...smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you are using osp, LMUDataRoot, and load.

Suggested change

from ...smp import *

from ...smp import LMUDataRoot, load, osp

gemini-code-assist · 2026-03-03T10:01:38Z

scieval/dataset/Protein_GO/utils/protein_go.py

+import re
+from typing import List, Set, Tuple
+
+from ....smp import *


Avoid using wildcard imports (from ....smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you appear to be using dump, load, and pd from this import.

Suggested change

from ....smp import *

from ....smp import dump, load, pd

gemini-code-assist · 2026-03-03T10:01:38Z

scieval/dataset/Protein_GO/utils/protein_go.py

+from ....smp import *
+from ....smp.file import get_intermediate_file_path
+
+GO_ID_PATTERN = re.compile(r"GO:\d{7}")


The compiled regex GO_ID_PATTERN is defined but never used in the file. It should be removed to avoid dead code and improve maintainability.

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Protein GO annotation task for scientific evaluation#10

Add Protein GO annotation task for scientific evaluation#10
suencgo wants to merge 1 commit intoInternScience:mainfrom
suencgo:feature/protein-go

suencgo commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    data = load(eval_file)
+    if not osp.exists(eval_file):
+        raise FileNotFoundError(f"Eval file {eval_file} not found.")
+    if not any(eval_file.endswith(ext) for ext in ['.csv', '.tsv', '.json', '.xlsx']):
+        raise ValueError(f"Unsupported file format for {eval_file}. Please use .csv, .tsv, .json, or .xlsx")
+    data = load(eval_file)

	from ...smp import *
	from ...smp import LMUDataRoot, load, osp

Conversation

suencgo commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant