Add Protein GO annotation task for scientific evaluation#10
Add Protein GO annotation task for scientific evaluation#10suencgo wants to merge 1 commit intoInternScience:mainfrom
Conversation
- Add ProteinGODataset supporting go_bp, go_cc, go_mf (Gene Ontology BP/CC/MF) - Add evaluation utilities with multi-format parsing (GPT, Qwen, etc.) - Add GO term mapping files for BP, CC, MF categories - Register ProteinGODataset in TEXT_DATASET for run.py integration Made-with: Cursor
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new ProteinGODataset for evaluating protein Gene Ontology annotation prediction, including its dataset class, evaluation utilities, and GO term mapping files. However, significant security concerns exist regarding the handling of evaluation input files. The use of a generic load function on untrusted file paths could lead to Remote Code Execution (RCE) via insecure deserialization of pickle files and Server-Side Request Forgery (SSRF) via URL-based loading. Additionally, a lack of path sanitization when writing results may allow for arbitrary file writes. Strict validation on input file paths and formats is crucial to mitigate these risks. Beyond security, there are also areas for improvement in utility functions concerning Python best practices, such as avoiding wildcard imports, removing unused code, and using more specific exception handling.
| - Parse prediction to GO terms, compare with answer | ||
| - Return DataFrame: ['Total', 'Correct', 'Accuracy'] | ||
| """ | ||
| data = load(eval_file) |
There was a problem hiding this comment.
The evaluate_protein_go function passes the eval_file argument directly to the load utility. The load function (defined in scieval/smp/file.py) is dangerous because it automatically downloads files from URLs (leading to SSRF) and uses pickle.load for files with a .pkl extension (leading to Remote Code Execution). An attacker could provide a malicious URL or a path to a crafted pickle file to compromise the system.
Recommendation: Validate that eval_file is a local file path and that its extension is one of the expected safe formats (e.g., .csv, .tsv, .json, .xlsx). Avoid using load on untrusted paths if it supports unsafe deserialization.
| data = load(eval_file) | |
| if not osp.exists(eval_file): | |
| raise FileNotFoundError(f"Eval file {eval_file} not found.") | |
| if not any(eval_file.endswith(ext) for ext in ['.csv', '.tsv', '.json', '.xlsx']): | |
| raise ValueError(f"Unsupported file format for {eval_file}. Please use .csv, .tsv, .json, or .xlsx") | |
| data = load(eval_file) |
| def _try_extract_json_array(text: str) -> List[str] | None: | ||
| try: | ||
| import json | ||
| m = re.search(r'\[[\s\S]*?\]', text) | ||
| if m: | ||
| arr = json.loads(m.group()) | ||
| if isinstance(arr, list) and all(isinstance(x, str) for x in arr): | ||
| return [x.strip() for x in arr if x.strip()] | ||
| m = re.search(r'\{[\s\S]*?\}', text) | ||
| if m: | ||
| obj = json.loads(m.group()) | ||
| for key in ("annotations", "terms", "go_terms", "answer"): | ||
| if key in obj and isinstance(obj[key], list): | ||
| return [str(x).strip() for x in obj[key] if str(x).strip()] | ||
| except Exception: | ||
| pass | ||
| return None |
There was a problem hiding this comment.
This function has a couple of areas for improvement:
- The
import jsonstatement on line 69 should be moved to the top of the file with other imports. This is a standard Python convention (PEP 8) and improves code readability. - Catching a broad
Exceptionon line 81 is risky as it can hide unexpected bugs. It's better to catch more specific exceptions. In this context,json.JSONDecodeErrorandTypeErrorwould be more appropriate.
| def _try_extract_json_array(text: str) -> List[str] | None: | |
| try: | |
| import json | |
| m = re.search(r'\[[\s\S]*?\]', text) | |
| if m: | |
| arr = json.loads(m.group()) | |
| if isinstance(arr, list) and all(isinstance(x, str) for x in arr): | |
| return [x.strip() for x in arr if x.strip()] | |
| m = re.search(r'\{[\s\S]*?\}', text) | |
| if m: | |
| obj = json.loads(m.group()) | |
| for key in ("annotations", "terms", "go_terms", "answer"): | |
| if key in obj and isinstance(obj[key], list): | |
| return [str(x).strip() for x in obj[key] if str(x).strip()] | |
| except Exception: | |
| pass | |
| return None | |
| def _try_extract_json_array(text: str) -> List[str] | None: | |
| try: | |
| m = re.search(r'\[[\s\S]*?\]', text) | |
| if m: | |
| arr = json.loads(m.group()) | |
| if isinstance(arr, list) and all(isinstance(x, str) for x in arr): | |
| return [x.strip() for x in arr if x.strip()] | |
| m = re.search(r'\{[\s\S]*?\}', text) | |
| if m: | |
| obj = json.loads(m.group()) | |
| for key in ("annotations", "terms", "go_terms", "answer"): | |
| if key in obj and isinstance(obj[key], list): | |
| return [str(x).strip() for x in obj[key] if str(x).strip()] | |
| except (json.JSONDecodeError, TypeError): | |
| pass | |
| return None |
| res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]}) | ||
| score_file = get_intermediate_file_path(eval_file, "_acc", "csv") | ||
| dump(res, score_file) |
There was a problem hiding this comment.
The evaluate_protein_go function constructs a result file path (score_file) by appending a suffix to the user-provided eval_file path and then writes to it using dump. Since eval_file is not sanitized, an attacker can use path traversal sequences (e.g., ../../) or absolute paths to cause the application to write evaluation results to arbitrary locations on the filesystem.
Recommendation: Sanitize the eval_file path to ensure it stays within an expected directory, or use only the basename of the file when constructing the output path.
| res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]}) | |
| score_file = get_intermediate_file_path(eval_file, "_acc", "csv") | |
| dump(res, score_file) | |
| res = pd.DataFrame({"Total": [total], "Correct": [correct], "Accuracy": [acc * 100.0]}) | |
| # Use basename to prevent path traversal when writing results | |
| output_filename = get_intermediate_file_path(osp.basename(eval_file), "_acc", "csv") | |
| score_file = osp.join(osp.dirname(eval_file), output_filename) | |
| dump(res, score_file) |
|
|
||
| from ..text_base import TextBaseDataset | ||
| from .utils.protein_go import evaluate_protein_go | ||
| from ...smp import * |
There was a problem hiding this comment.
Avoid using wildcard imports (from ...smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you are using osp, LMUDataRoot, and load.
| from ...smp import * | |
| from ...smp import LMUDataRoot, load, osp |
| import re | ||
| from typing import List, Set, Tuple | ||
|
|
||
| from ....smp import * |
There was a problem hiding this comment.
Avoid using wildcard imports (from ....smp import *). They can lead to namespace pollution and make the code harder to read and maintain. It's better to explicitly import the names you need. In this file, you appear to be using dump, load, and pd from this import.
| from ....smp import * | |
| from ....smp import dump, load, pd |
| from ....smp import * | ||
| from ....smp.file import get_intermediate_file_path | ||
|
|
||
| GO_ID_PATTERN = re.compile(r"GO:\d{7}") |
Made-with: Cursor