Asset Detail:

ML Ready Pathology Reports

Asset Detail:

ML Ready Pathology Reports
Overview
ASSET LINK: https://modac.cancer.gov/assetDetails?dme_data_id=NCI-DME-MS01-7423964
PROGRAM NAME: NCI-DOE Collaboration
STUDY NAME: NCI DOE Collaboration MOSSAIC project: Population Information Integration, Analysis, and Modeling for Precision Surveillance
ASSET NAME: ML Ready Pathology Reports
ASSET PATH: /NCI_DOE_Archive/JDACS4C/JDACS4C_Pilot_3/ml_ready_pathology_reports
Asset Attributes
  ATTRIBUTE VALUE
ASSET NAME ML Ready Pathology Reports
ASSET DESCRIPTION This asset contains 7187 pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute. The files in ml_ready_raw_text_pathology_reports.tar.gz were converted from PDF to text using an optical character recognition program (refer to the Tesseract link). An example of a report is available on the GDC archive portal (refer to the GDC link). The file ml_ready_raw_text_histo_metadata.csv contains annotations (such as site and histology) extracted from those reports. This data set is used as input to MT-CNN and HiSan (refer to the GitHub Repository links and Model links).
ASSET IDENTIFIER ml_ready_pathology_reports
ASSET TYPE Dataset
PLATFORM VERSION None
IS REFERENCE DATASET No
COLLECTION SIZE 11.9 MB
GDC https://portal.gdc.cancer.gov/legacy-archive/files/a9a42650-4613-448d-895e-4f904285f508
GITHUB REPOSITORY HISAN https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Pathology-Reports-Hierarchical-Self-Attention-Network
GITHUB REPOSITORY MT-CNN https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Multitask-Convolutional_Neural_Network
MODEL HISAN https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7565752
MODEL MT-CNN https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7330732
TESSERACT https://github.com/tesseract-ocr/

Asset Files

To download files, please login.

FILE/COLLECTION FILE SIZE ACTIONS
Back To Top