Skip to content

Add Protein GO annotation task for scientific evaluation#10

Open
suencgo wants to merge 1 commit intoInternScience:mainfrom
suencgo:feature/protein-go
Open

Add Protein GO annotation task for scientific evaluation#10
suencgo wants to merge 1 commit intoInternScience:mainfrom
suencgo:feature/protein-go

Conversation

@suencgo
Copy link

@suencgo suencgo commented Mar 3, 2026

  • Add ProteinGODataset supporting go_bp, go_cc, go_mf (Gene Ontology BP/CC/MF)
  • Add evaluation utilities with multi-format parsing (GPT, Qwen, etc.)
  • Add GO term mapping files for BP, CC, MF categories
  • Register ProteinGODataset in TEXT_DATASET for run.py integration

Made-with: Cursor

- Add ProteinGODataset supporting go_bp, go_cc, go_mf (Gene Ontology BP/CC/MF)
- Add evaluation utilities with multi-format parsing (GPT, Qwen, etc.)
- Add GO term mapping files for BP, CC, MF categories
- Register ProteinGODataset in TEXT_DATASET for run.py integration

Made-with: Cursor
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new ProteinGODataset into the SciEvalKit, specifically designed for evaluating protein Gene Ontology annotation prediction models. It provides structured data loading for various GO categories and includes versatile evaluation utilities to process and compare predictions from diverse language models, thereby enhancing the platform's capabilities for scientific evaluation in bioinformatics.

Highlights

  • New Dataset Integration: Added ProteinGODataset to support Gene Ontology (GO) annotation prediction across Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) categories.
  • Enhanced Evaluation Utilities: Introduced robust evaluation utilities capable of parsing diverse model output formats (e.g., GPT, Qwen) for accurate GO term extraction and comparison.
  • GO Term Mappings: Included comprehensive mapping files for GO term IDs specific to the BP, CC, and MF branches.
  • System Integration: Registered the newly added ProteinGODataset within the TEXT_DATASET list, ensuring its seamless integration and availability for run.py.
Changelog
  • scieval/dataset/Protein_GO/init.py
    • Initialized the Protein_GO package, exposing the ProteinGODataset.
  • scieval/dataset/Protein_GO/mappings/go_bp_mapping.txt
    • Added GO term ID mappings for the Biological Process branch.
  • scieval/dataset/Protein_GO/mappings/go_cc_mapping.txt
    • Added GO term ID mappings for the Cellular Component branch.
  • scieval/dataset/Protein_GO/mappings/go_mf_mapping.txt
    • Added GO term ID mappings for the Molecular Function branch.
  • scieval/dataset/Protein_GO/protein_go.py
    • Implemented the ProteinGODataset class, including data loading and exact match evaluation logic for GO annotation prediction.
  • scieval/dataset/Protein_GO/utils/init.py
    • Initialized the utils subpackage, exposing GO evaluation functions.
  • scieval/dataset/Protein_GO/utils/protein_go.py
    • Developed utilities for cleaning, parsing, extracting, normalizing, and evaluating GO term predictions from various model outputs.
  • scieval/dataset/init.py
    • Updated dataset imports to include ProteinGODataset.
    • Registered ProteinGODataset in the TEXT_DATASET list for broader system access.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ProteinGODataset for evaluating protein Gene Ontology annotation prediction, including its dataset class, evaluation utilities, and GO term mapping files. However, significant security concerns exist regarding the handling of evaluation input files. The use of a generic load function on untrusted file paths could lead to Remote Code Execution (RCE) via insecure deserialization of pickle files and Server-Side Request Forgery (SSRF) via URL-based loading. Additionally, a lack of path sanitization when writing results may allow for arbitrary file writes. Strict validation on input file paths and formats is crucial to mitigate these risks. Beyond security, there are also areas for improvement in utility functions concerning Python best practices, such as avoiding wildcard imports, removing unused code, and using more specific exception handling.

- Parse prediction to GO terms, compare with answer
- Return DataFrame: ['Total', 'Correct', 'Accuracy']
"""
data = load(eval_file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The evaluate_protein_go function passes the eval_file argument directly to the load utility. The load function (defined in scieval/smp/file.py) is dangerous because it automatically downloads files from URLs (leading to SSRF) and uses pickle.load for files with a .pkl extension (leading to Remote Code Execution). An attacker could provide a malicious URL or a path to a crafted pickle file to compromise the system.

Recommendation: Validate that eval_file is a local file path and that its extension is one of the expected safe formats (e.g., .csv, .tsv, .json, .xlsx). Avoid using load on untrusted paths if it supports unsafe deserialization.

Suggested change
data = load(eval_file)
if not osp.exists(eval_file):
raise FileNotFoundError(f"Eval file {eval_file} not found.")
if not any(eval_file.endswith(ext) for ext in ['.csv', '.tsv', '.json', '.xlsx']):
raise ValueError(f"Unsupported file format for {eval_file}. Please use .csv, .tsv, .json, or .xlsx")
data = load(eval_file)

Comment on lines +67 to +83
def _try_extract_json_array(text: str) -> List[str] | None:
try:
import json
m = re.search(r'\[[\s\S]*?\]', text)
if m:
arr = json.loads(m.group())
if isinstance(arr, list) and all(isinstance(x, str) for x in arr):
return [x.strip() for x in arr if x.strip()]
m = re.search(r'\{[\s\S]*?\}', text)
if m:
obj = json.loads(m.group())
for key in ("annotations", "terms", "go_terms", "answer"):
if key in obj and isinstance(obj[key], list):
return [str(x).strip() for x in obj[key] if str(x).strip()]
except Exception:
pass
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function has a couple of areas for improvement:

  1. The import json statement on line 69 should be moved to the top of the file with other imports. This is a standard Python convention (PEP 8) and improves code readability.
  2. Catching a broad Exception on line 81 is risky as it can hide unexpected bugs. It's better to catch more specific exceptions. In this context, json.JSONDecodeError and TypeError would be more appropriate.
Suggested change
def _try_extract_json_array(text: str) -> List[str] | None:
try:
import json
m = re.search(r'\[[\s\S]*?\]', text)
if m:
arr = json.loads(m.group())
if isinstance(arr, list) and all(isinstance(x, str) for x in arr):
return [x.strip() for x in arr if x.strip()]
m = re.search(r'\{[\s\S]*?\}', text)
if m:
obj = json.loads(m.group())
for key in ("annotations", "terms", "go_terms", "answer"):
if key in obj and isinstance(obj[key], list):
return [str(x).strip() for x in obj[key] if str(x).strip()]
except Exception:
pass
return None
def _try_extract_json_array(text: str) -> List[str] | None:
try:
m = re.search(r'\[[\s\S]*?\]', text)
if m:
arr = json.loads(m.group())
if isinstance(arr, list) and all(isinstance(x, str) for x in arr):
return [x.strip() for x in arr if x.strip()]
m = re.search(r'\{[\s\S]*?\}', text)
if m:
obj = json.loads(m.group())
for key in ("annotations", "terms", "go_terms", "answer"):
if key in obj and isinstance(obj[key], list):
return [str(x).strip() for x in obj[key] if str(x).strip()]
except (json.JSONDecodeError, TypeError):
pass
return None

Comment on lines +177 to +179
res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})
score_file = get_intermediate_file_path(eval_file, "_acc", "csv")
dump(res, score_file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The evaluate_protein_go function constructs a result file path (score_file) by appending a suffix to the user-provided eval_file path and then writes to it using dump. Since eval_file is not sanitized, an attacker can use path traversal sequences (e.g., ../../) or absolute paths to cause the application to write evaluation results to arbitrary locations on the filesystem.

Recommendation: Sanitize the eval_file path to ensure it stays within an expected directory, or use only the basename of the file when constructing the output path.

Suggested change
res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})
score_file = get_intermediate_file_path(eval_file, "_acc", "csv")
dump(res, score_file)
res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]})
# Use basename to prevent path traversal when writing results
output_filename = get_intermediate_file_path(osp.basename(eval_file), "_acc", "csv")
score_file = osp.join(osp.dirname(eval_file), output_filename)
dump(res, score_file)


from ..text_base import TextBaseDataset
from .utils.protein_go import evaluate_protein_go
from ...smp import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using wildcard imports (from ...smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you are using osp, LMUDataRoot, and load.

Suggested change
from ...smp import *
from ...smp import LMUDataRoot, load, osp

import re
from typing import List, Set, Tuple

from ....smp import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using wildcard imports (from ....smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you appear to be using dump, load, and pd from this import.

Suggested change
from ....smp import *
from ....smp import dump, load, pd

from ....smp import *
from ....smp.file import get_intermediate_file_path

GO_ID_PATTERN = re.compile(r"GO:\d{7}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The compiled regex GO_ID_PATTERN is defined but never used in the file. It should be removed to avoid dead code and improve maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant