├── .gitattributes ├── curie_dist.png ├── extras ├── ICLR_2025_poster_CURIE.pdf └── CURIE_ICLR2025_deck_shared.pdf ├── data ├── data.zip ├── sha256sum.txt └── README.md ├── .github └── workflows │ └── generate_checksum.yml ├── prompts ├── georeference_image_0_shot.txt ├── reconstruct_protein_amino_acid_sequence_0_shot.txt ├── write_code_for_paper_0_shot.txt ├── extract_hamiltonian_0_shot.txt ├── mat_paper_to_passage_0_shot.txt ├── mat_passage_to_property_0_shot.txt ├── extract_dataset_from_geo_papers_0_shot.txt ├── mat_paper_to_property_0_shot.txt ├── mat_paper_to_property_1_shot_bandgap_refractive.txt ├── mat_paper_to_property_1_shot_exclude_trivia.txt ├── extract_structure_data_1_shot.txt ├── extract_dft_metadata_1_shot.txt ├── dft_metadata_eval_output_1_shot.txt ├── dft_structure_eval_output_1_shot.txt ├── describe_code_in_paper.txt ├── mat_eval_output_1_shot.txt └── mat_paper_to_property_1_shot.txt ├── docs ├── contributing.md └── code-of-conduct.md ├── README.md ├── LICENSE └── colabs ├── Example_PDB_long_llama_inference.ipynb ├── CURIEbenchmark_inference_Command_R_Plus.ipynb ├── curie_inference.ipynb └── curie_generate_tables_figures.ipynb /.gitattributes: -------------------------------------------------------------------------------- 1 | */*.zip filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /curie_dist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/curie/HEAD/curie_dist.png -------------------------------------------------------------------------------- /extras/ICLR_2025_poster_CURIE.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/curie/HEAD/extras/ICLR_2025_poster_CURIE.pdf -------------------------------------------------------------------------------- /extras/CURIE_ICLR2025_deck_shared.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google/curie/HEAD/extras/CURIE_ICLR2025_deck_shared.pdf -------------------------------------------------------------------------------- /data/data.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ab634136a5466412f13fbe6c86a6bc283ba1e760a49e07d24356407c844c9c01 3 | size 137211000 4 | -------------------------------------------------------------------------------- /data/sha256sum.txt: -------------------------------------------------------------------------------- 1 | # The SHA 256 sum generated using. 2 | # sha256sum data/data.zip 3 | ab634136a5466412f13fbe6c86a6bc283ba1e760a49e07d24356407c844c9c01 data/data.zip 4 | 5 | -------------------------------------------------------------------------------- /.github/workflows/generate_checksum.yml: -------------------------------------------------------------------------------- 1 | name: Generate Zip Checksums 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | 7 | jobs: 8 | checksum: 9 | name: Generate Checksums for Zip Files 10 | runs-on: ubuntu-latest 11 | steps: 12 | - name: Checkout code 13 | uses: actions/checkout@v4 14 | 15 | - name: Generate checksum for zip files 16 | uses: jmgilman/actions-generate-checksum@v1 17 | with: 18 | patterns: | 19 | data/*.zip 20 | 21 | # Optional: Upload the checksum file as an artifact 22 | - name: Upload checksum artifact 23 | uses: actions/upload-artifact@v4 24 | with: 25 | name: checksums 26 | path: checksum.txt 27 | 28 | -------------------------------------------------------------------------------- /prompts/georeference_image_0_shot.txt: -------------------------------------------------------------------------------- 1 | For the following figure and caption, please return the WGS84 latitude and longitude bounding box coordinates of the map in the image. 2 | If there are multiple maps of the same region, please just return only one answer. 3 | If two areas are represented, and one is an inset of the other, return the smaller of the two areas. 4 | If you are not sure, please guess at an answer anyway. I'd rather have an answer than no answer at all. 5 | Make sure to return decimal coordinates in range [-90, 90] for latitude and (-180, 180] for longitude. 6 | Please put your answer in the following JSON format: 7 | { 8 | "W": , 9 | "S": , 10 | "E": , 11 | "N": 12 | } 13 | Please return only the JSON output. 14 | Here is the image and caption. 15 | {{text}} 16 | -------------------------------------------------------------------------------- /prompts/reconstruct_protein_amino_acid_sequence_0_shot.txt: -------------------------------------------------------------------------------- 1 | You are a computational biologist and I want you to reconstruct a protein's amino acid sequence from its tertiary structure. 2 | * The input is a PDB that is a textual format describing the three-dimensional structures of a protein. 3 | * Return the amino acid sequence in the standard FASTA format, which starts with a definition line with the greater than (>) line, 4 | followed by the single-letter codes for all amino acids in the second line. 5 | * Make sure the amino acid sequence is in the second line. 6 | * If there is an unknown amino acid in the structure, put "X" in the sequence. 7 | * Make sure you go through the whole structure and get all the amino acids. 8 | * No extra explanation is needed. 9 | 10 | below are the tertiary structure: 11 | 12 | {{text}} 13 | 14 | -------------------------------------------------------------------------------- /prompts/write_code_for_paper_0_shot.txt: -------------------------------------------------------------------------------- 1 | I am a computational materials scientist, and I would like to reproduce the DFT calculations from a paper. \ 2 | Write python code to reproduce all DFT calculations from a paper. \ 3 | You have access to the library ase, and any DFT software used in the paper. \ 4 | Whenever possible, make the code modular by write functions for each step of the calculation \ 5 | (i.e. set up the unit cell, run DFT, find the total energy from the DFT calculation, plot the band structure, ...) \ 6 | and calling the functions, instead of writing one long body of code. \ 7 | Make sure all input structures, DFT calculations, and calculation outputs such as energies, density of states or band gap are accounted for. 8 | --- 9 | This is the paper I'd like to write the python code for: 10 | {{text}} 11 | --- 12 | Output: 13 | -------------------------------------------------------------------------------- /prompts/extract_hamiltonian_0_shot.txt: -------------------------------------------------------------------------------- 1 | You are a physicist. 2 | You are reading a paper that has explicit equation (or equations) 3 | for the general Hartree-Fock or mean-field Hamiltonian. 4 | There are might be several Hartree-Fock Hamiltonians in the paper. 5 | You should return the one that is the most general. 6 | Return this Hamiltonian. 7 | Print out each equation explicitely instead of citing it. 8 | Print out terms in the Hamiltonian if they are present in the paper, 9 | including intergrals. 10 | Do not explain the Hamiltonian or the terms 11 | Return the Hamiltonian in the following format: 12 | 'The general Hartree-Fock Hamiltonian is 13 | {{The Hartree-Fock or mean-field Hamiltonian}} 14 | where {{include all terms in the Hamiltonian}}' 15 | Be concise. Do not explain constants. 16 | \n 17 | PAPER: \n 18 | {{text}} 19 | \n 20 | YOUR RESPONSE: 21 | -------------------------------------------------------------------------------- /prompts/mat_paper_to_passage_0_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all passages with numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | There are some certain rules you need to follow: 3 | 1. Be thorough! Don't miss even a single passage that contains numeric value. 4 | 2. Never record a passage that only has empty/null value for all properties in it. 5 | 6 | This below is the excerpt from a paper in latex format published in a materials science journal. 7 | Excerpt: 8 | text: 9 | {{text}} 10 | ------------------------------ 11 | With the above excerpt, the goal is to extract passages that contain numeric values or numeric value ranges of material properties from the above paper. 12 | The output should be in json format: 13 | ["passage_1", "passage_2", "passage_3"] 14 | Please only output your answer in the exact format as shown above. 15 | Output: 16 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | 2 | Data accompanying the paper 3 | 4 | **CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning** 5 | 6 | ## 📁 Repository Structure 7 | 8 | The contents of this repository are structured as follows: 9 | 10 | ```bash 11 | data 12 | ├── domain 13 | ├── inputs 14 | │ └── record_id.json 15 | └── ground_truth 16 | └── record_id.json 17 | └── difficulty_levels.json 18 | 19 | ``` 20 | The data folder consists of a single folder for each domain. Under each domain we 21 | have the `inputs` and `ground_truth` directories that contain the data for each 22 | single data point. 23 | 24 | The `difficulty_levels.json` contains the difficulty level (`easy`, `medium`, 25 | `hard`) values for each record for each task. 26 | 27 | 28 | ## 🧪 Running notebooks on the data 29 | 30 | To run the accompanying notebooks on the dataset, we recommend unzipping the 31 | data folder and uploading it to Google Drive and adding the path to the Folder 32 | in the notebooks. 33 | 34 | -------------------------------------------------------------------------------- /docs/contributing.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We would love to accept your patches and contributions to this project. 4 | 5 | ## Before you begin 6 | 7 | ### Sign our Contributor License Agreement 8 | 9 | Contributions to this project must be accompanied by a 10 | [Contributor License Agreement](https://cla.developers.google.com/about) (CLA). 11 | You (or your employer) retain the copyright to your contribution; this simply 12 | gives us permission to use and redistribute your contributions as part of the 13 | project. 14 | 15 | If you or your current employer have already signed the Google CLA (even if it 16 | was for a different project), you probably don't need to do it again. 17 | 18 | Visit to see your current agreements or to 19 | sign a new one. 20 | 21 | ### Review our Community Guidelines 22 | 23 | This project follows [Google's Open Source Community 24 | Guidelines](https://opensource.google/conduct/). 25 | 26 | ## Contribution process 27 | 28 | ### Code Reviews 29 | 30 | All submissions, including submissions by project members, require review. We 31 | use [GitHub pull requests](https://docs.github.com/articles/about-pull-requests) 32 | for this purpose. 33 | -------------------------------------------------------------------------------- /prompts/mat_passage_to_property_0_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | There are some certain rules you need to follow: 3 | 1. Be thorough! Don't miss even a single numeric value. 4 | 2. If one passage contains multiple materials or properties, record all of them as separate entries. 5 | 3. If different values are mentioned for the same material property, record all of them as separate entries. 6 | 4. Never record a property with empty/null value. 7 | 8 | This below is a list of passages that contains the numeric values of material properties from the the below paper excerpt. 9 | list: '{ passages }' 10 | 11 | And this below is the excerpt from the paper in latex format published in a materials science journal. 12 | Excerpt: 13 | text: {{text}} 14 | ------------------------------ 15 | With the above list and paper text excerpt, the goal is to extract numeric values or numeric value ranges of material properties from the above passage in the paper. 16 | The output should be in json format: 17 | [ 18 | { 19 | "name": "material_1", 20 | "properties": [ 21 | { 22 | "name": "property_1", 23 | "low_value": "1.0", 24 | "high_value": "1.0", 25 | "units": "unit_1", 26 | "source_passage": "" 27 | }, 28 | { 29 | "name": "property_2", 30 | "low_value": "3.0", 31 | "high_value": "3.0", 32 | "units": "unit_2", 33 | "source_passage": "" 34 | } 35 | ] 36 | }, 37 | { 38 | "name": "material_2", 39 | "properties": [ 40 | { 41 | "name": "property_3", 42 | "low_value": "1.7", 43 | "high_value": "1.9", 44 | "units": "unit_3", 45 | "source_passage": "" 46 | } 47 | ] 48 | } 49 | ] 50 | Please only output your answer in the exact format as shown above. 51 | -------------------------------------------------------------------------------- /prompts/extract_dataset_from_geo_papers_0_shot.txt: -------------------------------------------------------------------------------- 1 | Given the paper, please gather the following information and put it in a JSON format. 2 | Here is the JSON format: 3 | 4 | { 5 | "paper_title": , 6 | "paper_link": , 7 | "datasets": [{ 8 | "dataset_name": , 9 | "dataset_website_or_source": [], 10 | "variables": [{ 11 | "variable_name": , 12 | "description": , 13 | "time_range": { 14 | "start_date": , 15 | "end_date": 16 | }, 17 | "spatial_range": 18 | },], 19 | }], 20 | "notes": 21 | } 22 | 23 | 24 | 25 | is the paper title. 26 | is the paper link. 27 | For ALL datasets used in the paper: 28 | is the name of dataset. 29 | is a link to the dataset. 30 | is the list of variables used in dataset. Be thorough and descriptive. 31 | is a list of all the names of variables. 32 | is the description of variable. 33 | * Example: if data includes all tweets that include a set of keywords between two dates, explain this with enough detail that the dataset could be reproduced if we had access to the raw tweets. For example if the dataset is based on google trends, you need to include search terms or categories used 34 | * FORMAT: VARIABLE_1, VARIABLE_2, VARIABLE_3, 35 | is a list of all time ranges of the variables. 36 | is the start date of the time range (format: yy-mm-dd). 37 | is the end date of the time range (format: yy-mmd-dd). 38 | * Note: Make sure to give all the time ranges in a list format. 39 | 40 | is a list of all spatial ranges of variables. 41 | * Note: Make sure to specify all locations with enough detail that the dataset can be exactly reproduced. So if a dataset includes data from 196 counties in the US, please give explicitly the names of all of the counties. 42 | * Note: Make sure to give the names of all the locations in a list format. 43 | * Example: County vs State vs Country vs Census Block. 44 | add a note if you have issues filling out any of the fields. 45 | Please copy directly the text of the paper. 46 | Please be concise but have enough detail to reproduce. 47 | Make sure you generate the JSON. 48 | 49 | Here is the paper: 50 | {{text}} 51 | 52 | -------------------------------------------------------------------------------- /prompts/mat_paper_to_property_0_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | 3 | The output should be in json format: 4 | [ 5 | { 6 | "material": "", 7 | "material_descriptor": "", 8 | "material_source_passage": "", 9 | "material_source_table": "", 10 | "property_name": "", 11 | "property_descriptor": "", 12 | "property_source_passage": "", 13 | "property_source_table": "", 14 | "low_value" : "", 15 | "high_value": "", 16 | "value_units": "", 17 | "value_source_passage": "", 18 | "value_source_table": "", 19 | }, ... 20 | ] 21 | 22 | An example of the output is as follows: 23 | [ 24 | { 25 | "material":"HfO2<\/sub>", 26 | "material_descriptor":"films\natomic layer deposition\nThickness 9.80 nm", 27 | "material_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 28 | "material_source_table":"1", 29 | "property_name":"refractive index\nn\nindex of refraction", 30 | "property_descriptor":"Wavelength 550 nm\nphoton energy 2.26 eV\nGaussian dispersion model", 31 | "property_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 32 | "property_source_table":"", 33 | "low_value":"1.84", 34 | "high_value":"1.84", 35 | "value_units":"", 36 | "value_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 37 | "value_source_table":"", 38 | }, ... 39 | ] 40 | 41 | There are some certain rules you need to follow: 42 | 1. Be thorough! Don't miss even a single numeric value. 43 | 2. Avoid null low_value and high_value. Simply skip the entry if you cannot extract the numeric value. 44 | 3. If a value is only extractable from a figure, not text or table, simply skip that entry. 45 | 4. Make sure whatever low_value and high_value actually exists in the sourcepassage or refered table. 46 | 5. If one passage contains multiple materials or properties, record all of them as separate entries. 47 | 6. If different values are mentioned for the same material property, record all of them as separate entries. 48 | 7. Material, property, and value may or may not be mentioned in the same passage. 49 | 50 | This below is the excerpt from a paper in latex format published in a materials science journal. 51 | Excerpt: 52 | text: 53 | {{text}} 54 | ------------------------------ 55 | With the above excerpt, the goal is to extract numeric values or numeric value ranges of material properties from the above paper. 56 | 57 | Please only output your answer in the exact format as shown above without any prefix. 58 | Output: 59 | -------------------------------------------------------------------------------- /prompts/mat_paper_to_property_1_shot_bandgap_refractive.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | 3 | The output should be in json format: 4 | [ 5 | { 6 | "material": "", 7 | "material_descriptor": "", 8 | "material_source_passage": "", 9 | "material_source_table": "", 10 | "property_name": "", 11 | "property_descriptor": "", 12 | "property_source_passage": "", 13 | "property_source_table": "", 14 | "low_value" : "", 15 | "high_value": "", 16 | "value_units": "", 17 | "value_source_passage": "", 18 | "value_source_table": "", 19 | }, ... 20 | ] 21 | 22 | An example of the output is as follows: 23 | [ 24 | { 25 | "material":"HfO2<\/sub>", 26 | "material_descriptor":"films\natomic layer deposition\nThickness 9.80 nm", 27 | "material_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 28 | "material_source_table":"1", 29 | "property_name":"refractive index\nn\nindex of refraction", 30 | "property_descriptor":"Wavelength 550 nm\nphoton energy 2.26 eV\nGaussian dispersion model", 31 | "property_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 32 | "property_source_table":"", 33 | "low_value":"1.84", 34 | "high_value":"1.84", 35 | "value_units":"", 36 | "value_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 37 | "value_source_table":"", 38 | }, ... 39 | ] 40 | 41 | There are some certain rules you need to follow: 42 | 1. Be thorough! Don't miss even a single numeric value. 43 | 2. Avoid null low_value and high_value. Simply skip the entry if you cannot extract the numeric value. 44 | 3. If a value is only extractable from a figure, not text or table, simply skip that entry. 45 | 4. Make sure whatever low_value and high_value actually exists in the sourcepassage or refered table. 46 | 5. If one passage contains multiple materials or properties, record all of them as separate entries. 47 | 6. If different values are mentioned for the same material property, record all of them as separate entries. 48 | 7. Material, property, and value may or may not be mentioned in the same passage. 49 | 8. You only need to focus on two properties: bandgap, and refractive index. Be flexible on synonyms. 50 | 51 | This below is the excerpt from a paper in latex format published in a materials science journal. 52 | Excerpt: 53 | text: 54 | {{text}} 55 | ------------------------------ 56 | With the above excerpt, the goal is to extract numeric values or numeric value ranges of material properties from the above paper. 57 | 58 | Please only output your answer in the exact format as shown above without any prefix. 59 | Output: 60 | -------------------------------------------------------------------------------- /prompts/mat_paper_to_property_1_shot_exclude_trivia.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | 3 | The output should be in json format: 4 | [ 5 | { 6 | "material": "", 7 | "material_descriptor": "", 8 | "material_source_passage": "", 9 | "material_source_table": "", 10 | "property_name": "", 11 | "property_descriptor": "", 12 | "property_source_passage": "", 13 | "property_source_table": "", 14 | "low_value" : "", 15 | "high_value": "", 16 | "value_units": "", 17 | "value_source_passage": "", 18 | "value_source_table": "", 19 | }, ... 20 | ] 21 | 22 | An example of the output is as follows: 23 | [ 24 | { 25 | "material":"HfO2<\/sub>", 26 | "material_descriptor":"films\natomic layer deposition\nThickness 9.80 nm", 27 | "material_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 28 | "material_source_table":"1", 29 | "property_name":"refractive index\nn\nindex of refraction", 30 | "property_descriptor":"Wavelength 550 nm\nphoton energy 2.26 eV\nGaussian dispersion model", 31 | "property_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 32 | "property_source_table":"", 33 | "low_value":"1.84", 34 | "high_value":"1.84", 35 | "value_units":"", 36 | "value_source_passage":"Figure 2a shows the refractive index n as a function of photon energy for all samples as deduced from the analysis of the SE results. The n value for HfO2<\/sub> measured at 550-nm (2.26 eV) wavelength is 1.84, which is similar with the previous report [20].", 37 | "value_source_table":"", 38 | }, ... 39 | ] 40 | 41 | There are some certain rules you need to follow: 42 | 1. Be thorough! Don't miss even a single numeric value. 43 | 2. Avoid null low_value and high_value. Simply skip the entry if you cannot extract the numeric value. 44 | 3. If a value is only extractable from a figure, not text or table, simply skip that entry. 45 | 4. Make sure whatever low_value and high_value actually exists in the sourcepassage or refered table. 46 | 5. If one passage contains multiple materials or properties, record all of them as separate entries. 47 | 6. If different values are mentioned for the same material property, record all of them as separate entries. 48 | 7. Material, property, and value may or may not be mentioned in the same passage. 49 | 8. Only extract real physical or chemical properties that are important and relevant. Do not include trivia in your output, such as geometries. 50 | 51 | This below is the excerpt from a paper in latex format published in a materials science journal. 52 | Excerpt: 53 | text: 54 | {{text}} 55 | ------------------------------ 56 | With the above excerpt, the goal is to extract numeric values or numeric value ranges of material properties from the above paper. 57 | 58 | Please only output your answer in the exact format as shown above without any prefix. 59 | Output: 60 | -------------------------------------------------------------------------------- /prompts/extract_structure_data_1_shot.txt: -------------------------------------------------------------------------------- 1 | A materials scientist would like to reproduce the DFT calculations from a paper. 2 | They want to identify the input structures, and gather as much information about the structures as possible. \ 3 | 4 | Make sure to identify all input structures, and output a list of the distinct input structures information with fields \ 5 | "id", "common_name", "scientific_name", "type", "composition", "description", "vacuum_x", "vacuum_y", "vacuum_z", "supercell", "cas_number", "lattice_a", "lattice_b", "lattice_c", "space group", "orientation", "mp_id", "isomer_name". \ 6 | The "id" field is just a string of format "structure_metadata_{number}", where {number} starts from 1 and indicates the order the structure appears in the excerpt. \ 7 | The "description" field should capture all relevant information that is not captured by the other fields. Make sure to write your output as a list of dictionaries, NOT A BULLETED LIST. \ 8 | The "common_name" field should have the common name of the material and "scientific_name" should have the formal scientific name of the material. \ 9 | The "type" field should point out whether the material is a molecule, a protein, a bulk crystal structure, a thin film or something else. \ 10 | If the structure has vacuum around it, please include the thickness of vacuum layer in three directions with units in these fields: "vacuum_x", "vacuum_y", "vacuum_z". \ 11 | The fields "lattice_a", "lattice_b", "lattice_c" correspond to the lattice parameters of the unit cell. Include units in these fields as well. Leave them blank if lattice parameters are missing. \ 12 | The "supercell" field should tell how many times the unit cell is repeated in the [x, y, z] directions, like "2x2x2". \ 13 | If the text indicates the material structure is from material project and provides a "mp_id", please include that in the field "mp_id". 14 | If the structure has multiple isomers and the text mentions which is used, please indicate that in the field "isomer_name". \ 15 | If any information is relevant but missing, input "UNKNOWN" in the field. If any information is irrelevant and missing, input None in the field. \ 16 | The "description" field should capture all relevant information that is not captured by the other fields. Make sure to write your output as a list of dictionaries, NOT A BULLETED LIST. 17 | 18 | This is an example excerpt and the output: 19 | Example excerpt: 20 | We performed the projector augmented wave (PAW) based spin-polarized simulations using the Vienna 21 | ab initio Simulation Package (VASP) [40], [41]. The STO has 5 atoms (1 Sr, 1 Ti and 3 O) in its 22 | unit cell. We considered a 2x2x2 supercell consisting of 40 atoms for all simulations presented in 23 | this article. The electronic configuration of STO is divided into valence and core to facilitate 24 | the PAW. We considered 10 electrons of Sr (4s24p65p2), 10 electrons of Ti (3p63d24s2) and 6 25 | electrons of O (2s22p4) as valence electrons (in total 26 valence electrons) and the rest were 26 | modelled as frozen core. In cases of doped STO, we considered Pt-5d96s1, S-3s23p4 and Se-4s24p6 as 27 | valence electrons. We replaced the 3d-TM Ti with 5d-TM Pt in case of the Pt-doped STO. 28 | 29 | Example output: 30 | [ 31 | {"id": "structure_metadata_1", "common_name": "STO", "scientific_name": "Strontium titanate", "type": "crystal", "composition": "SrTiO3", "description": "pure SrTiO3", "vacuum_x": None, "vacuum_y": None, "vacuum_z": None, "supercell": "[2,2,2]", "cas_number": "12060-59-2", "lattice_a": None, "lattice_b": None, "lattice_c": None, "space group": None, "orientation": None, "mp_id": None, "isomer_name": None}, 32 | {"id": "structure_metadata_2", "common_name": "Pt-doped STO", "scientific_name": "Platinum doped Strontium titanate", "type": "crystal", "composition": "UNKNOWN", "description": "replaced the 3d-TM Ti with 5d-TM Pt in case of the Pt-doped STO", "vacuum_x": None, "vacuum_y": None, "vacuum_z": None, "supercell": "[2,2,2]", "cas_number": "12060-59-2", "lattice_a": None, "lattice_b": None, "lattice_c": None, "space group": None, "orientation": None, "mp_id": None, "isomer_name": None} 33 | ] 34 | 35 | 36 | Here is the paper: 37 | 38 | {{text}} 39 | 40 | --- 41 | Identify the input structures, and gather as much information about the structures as possible. \ 42 | Use the format from the example output. 43 | -------------------------------------------------------------------------------- /prompts/extract_dft_metadata_1_shot.txt: -------------------------------------------------------------------------------- 1 | A materials scientist would like to reproduce the DFT calculations from a paper. 2 | They want detailed information about each of the DFT calculations in the paper. 3 | Output a list of the distinct DFT parameters with fields including, \ 4 | but not limited to, "software", "functional", "k-points-grid", "pseudopotentials", \ 5 | "basis_set", "energy_cutoff", "energy_convergence", "force_convergence", "relaxed_nuclei", \ 6 | "relaxed_unit_cell", "spin", "hubbard_U". \ 7 | If a structure is relaxed, please set "relaxed_nuclei" or "relaxed_unit_cell" \ 8 | to 1.0 (corresponding to true). \ 9 | If spin is involved in the calculation, set "spin" to 1.0 \ 10 | Include the units in these fields if applicable. \ 11 | If any information is relevant but missing, input "NaN" in the field. \ 12 | If any information is irrelevant and missing, input "NaN" in the field. \ 13 | Make sure to write your output in JSON format. \ 14 | Include any related information to the that has not been covered into the "other_information" field. 15 | 16 | Use the following format: 17 | [ 18 | { 19 | "function_name":"short_function_description", 20 | "software": "Specify which software was used, e.g. vasp, gaussian, castep, qe, dmol, orca, wein2k", 21 | "functional": "functional name", 22 | "k-points": "k-point grid as [x,y,z]", 23 | "energy_cutoff": "energy cutoff in eV if mentioned, else NaN", 24 | "energy_convergence": "energy convergence if mentioned, else NaN", 25 | "force_convergence": "force convergence if mentioned, else NaN", 26 | "relaxed_nuclei": "1.0 if nuclei is mentioned to be relaxed, 0.0 if mentioned to be fixed, else NaN", 27 | "relaxed_unit_cell": "1.0 if it is mentioned to be relaxed, 0.0 if if mentioned not relaxed or else NaN ", 28 | "spin": "1.0 if spin considered, 0.0 if not considered, else NaN", 29 | "hubbard_U": "NaN if not mentioned", 30 | "other_information": "Any other relevant information for the calculation." 31 | }, 32 | ... 33 | ] 34 | 35 | Here is an example excerpt, and the output: 36 | Example excerpt: 37 | We performed the projector augmented wave (PAW) based spin-polarized simulations using the Vienna 38 | ab initio Simulation Package (VASP) [40], [41]. The STO has 5 atoms (1 Sr, 1 Ti and 3 O) in its 39 | unit cell. We considered a 2x2x2 supercell consisting of 40 atoms for all simulations presented in 40 | this article. The electronic configuration of STO is divided into valence and core to facilitate 41 | the PAW. We considered 10 electrons of Sr (4s24p65p2), 10 electrons of Ti (3p63d24s2) and 6 42 | electrons of O (2s22p4) as valence electrons (in total 26 valence electrons) and the rest were 43 | modelled as frozen core. We replaced the 3d-TM Ti with 5d-TM Pt in case of the Pt-doped STO. 44 | We used 6 × 6 × 6 Monkhorst Pack grid k-points mesh to sample the Brillouin zone (BZ). The kinetic 45 | energy cutoff for plane waves was 540 eV. 46 | 47 | To understand the electronic properties of undoped STO, we simulated the 48 | spin-polarized total density of states (TDOS) within a 14 eV energy window centred at the Fermi 49 | level (E_F) for HSE06 functional as displayed in Fig. 1(a, b). The atomic density of states (DOS) 50 | confirms the dominance of O and Ti atoms in contributing to the electronic states near VBM and CBm 51 | respectively as displayed in Fig. 1(a). For the electronic band structure (BS), we 52 | considered the high symmetry points Γ, R, X and M in the Brillouin zone (BZ) of the cubic STO 53 | within the energy range from −8 to 8 eV centred at E_F, see Fig. 2. The HSE06 produced band gaps of 54 | both direct (3.66 eV at Γ ) and indirect (3.25 eV at R) nature which are in excellent agreement 55 | with the experimental observations of direct (indirect) band gap of 3.75 eV (3.25 eV) found in Ref. [4]. 56 | Pt-doped STO was found to be metallic. 57 | 58 | Example output: 59 | { 60 | "function_name":"create_dft_parameters_hse06", 61 | "software": "vasp", 62 | "functional": "HSE06", 63 | "k-points": "[6,6,6]", 64 | "energy_cutoff": 540.0, 65 | "energy_convergence": "NaN", 66 | "force_convergence": "None", 67 | "relaxed_nuclei": 0.0, 68 | "relaxed_unit_cell": "NaN", 69 | "spin": 1.0, 70 | "hubbard_U": "NaN", 71 | "other_information": "None" 72 | } 73 | --- 74 | 75 | Here is the paper: 76 | 77 | {{text}} 78 | 79 | --- 80 | 81 | Using the specified format, extract and \ 82 | list out the details of all the DFT calculations in the paper. 83 | 84 | Answer: 85 | -------------------------------------------------------------------------------- /prompts/dft_metadata_eval_output_1_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. You will be given one json format property or metadata from a paper by human annotators as the ground truth, 2 | and a list of json format properties or metadata extracted from the same paper by a LLM model. 3 | You goal is to find which json in the json list is most similar to the ground truth json. 4 | 5 | You will do so by comparing whether most non-NaN attributes are similar (a match) for the two jsons. 6 | You do not need to consider about attributes that are NaN when comparing. 7 | 8 | If most non-NaN attributes are similar, output the index of the json that pairs with the ground truth as "json_extracted_index", 9 | and details of comparison as "compare". 1 means similar or same, and 0 means totally different or NaN. The format will look like: 10 | { 11 | "json_extracted_index": , 12 | "compare": { 13 | "function_name": 1, 14 | "software": 1, 15 | "functional": 1, 16 | "k-points": 1, 17 | "energy_cutoff": 1, 18 | "energy_convergence": 0, 19 | "force_convergence": 0, 20 | "relaxed_nuclei": 1, 21 | "relaxed_unit_cell": 1, 22 | "spin": 0, 23 | "hubbard_U": 0, 24 | "other_information": 0 25 | } 26 | } 27 | 28 | Otherwise, if no json in the list is similar to the ground truth, simply output an empty json {}. 29 | 30 | Here are some detailed rules you have to follow: 31 | 1. Output json "compare" field should have mostly the same keys as json_ground_truth. 32 | 2. Output values for keys under "compare" are 0 or 1. Set 1 if json values are similar and non empty or NaN for the same key between ground truth and given jsons. Set 0 otherwise. 33 | 3. Be lenient when comparing texts. Tend to consider them the similar (value 1) as long as they have overlapping parts, regardless of formats. 34 | 4. Synonyms can be treated as being the similar. 35 | 5. For "json_extracted_index", use 0-based numbering. 36 | 37 | For example, if we have the following ground truth json and the list of llm extracted jsons. 38 | -------------------------------------------- 39 | json_ground_truth: 40 | { 41 | "function_name": "structural_optimization", 42 | "software": "VASP", 43 | "functional": "Perdew-Burke-Ernzerhof", 44 | "k-points": "NaN", 45 | "energy_cutoff": "NaN", 46 | "energy_convergence": "NaN", 47 | "force_convergence": "NaN", 48 | "relaxed_nuclei": 1.0, 49 | "relaxed_unit_cell": 1.0, 50 | "spin": "NaN", 51 | "hubbard_U": "NaN", 52 | "other_information": "None" 53 | } 54 | -------------------------------------------- 55 | json_extracted_list: 56 | [ 57 | { 58 | "function_name": "structural_optimization", 59 | "software": "vasp", 60 | "functional": "PBE", 61 | "k-points": "NaN", 62 | "energy_cutoff": "NaN", 63 | "energy_convergence": "NaN", 64 | "force_convergence": "NaN", 65 | "relaxed_nuclei": 1.0, 66 | "relaxed_unit_cell": 1.0, 67 | "spin": "NaN", 68 | "hubbard_U": "NaN", 69 | "other_information": "None" 70 | }, 71 | { 72 | "function_name": "phonon_calculation", 73 | "software": "vasp", 74 | "functional": "PBE", 75 | "k-points": "NaN", 76 | "energy_cutoff": "NaN", 77 | "energy_convergence": "NaN", 78 | "force_convergence": "NaN", 79 | "relaxed_nuclei": 0.0, 80 | "relaxed_unit_cell": 0.0, 81 | "spin": "NaN", 82 | "hubbard_U": "NaN", 83 | "other_information": "None" 84 | }, 85 | ... 86 | ] 87 | -------------------------------------------- 88 | 89 | By comparing the two jsons, the output will be output json: 90 | 91 | Output: 92 | { 93 | "json_extracted_index": 0, 94 | "compare": { 95 | "function_name": 1, 96 | "software": 1, 97 | "functional": 1, 98 | "k-points": 0, 99 | "energy_cutoff": 0, 100 | "energy_convergence": 0, 101 | "force_convergence": 0, 102 | "relaxed_nuclei": 1, 103 | "relaxed_unit_cell": 1, 104 | "spin": 0, 105 | "hubbard_U": 0, 106 | "other_information": 0 107 | } 108 | } 109 | 110 | 111 | With the given format, please output accordingly for the following jsons: 112 | -------------------------------------------- 113 | json_ground_truth: 114 | {{json_ground_truth}} 115 | -------------------------------------------- 116 | json_generated: 117 | {{json_extracted_list}} 118 | -------------------------------------------- 119 | 120 | Please only output your answer in the exact format as shown above without any prefix or suffix. 121 | Output: 122 | -------------------------------------------------------------------------------- /docs/code-of-conduct.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, gender identity and expression, level of 9 | experience, education, socio-economic status, nationality, personal appearance, 10 | race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or reject 41 | comments, commits, code, wiki edits, issues, and other contributions that are 42 | not aligned to this Code of Conduct, or to ban temporarily or permanently any 43 | contributor for other behaviors that they deem inappropriate, threatening, 44 | offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | This Code of Conduct also applies outside the project spaces when the Project 56 | Steward has a reasonable belief that an individual's behavior may have a 57 | negative impact on the project or its community. 58 | 59 | ## Conflict Resolution 60 | 61 | We do not believe that all conflict is bad; healthy debate and disagreement 62 | often yield positive results. However, it is never okay to be disrespectful or 63 | to engage in behavior that violates the project’s code of conduct. 64 | 65 | If you see someone violating the code of conduct, you are encouraged to address 66 | the behavior directly with those involved. Many issues can be resolved quickly 67 | and easily, and this gives people more control over the outcome of their 68 | dispute. If you are unable to resolve the matter for any reason, or if the 69 | behavior is threatening or harassing, report it. We are dedicated to providing 70 | an environment where participants feel welcome and safe. 71 | 72 | Reports should be directed to *[PROJECT STEWARD NAME(s) AND EMAIL(s)]*, the 73 | Project Steward(s) for *[PROJECT NAME]*. It is the Project Steward’s duty to 74 | receive and address reported violations of the code of conduct. They will then 75 | work with a committee consisting of representatives from the Open Source 76 | Programs Office and the Google Open Source Strategy team. If for any reason you 77 | are uncomfortable reaching out to the Project Steward, please email 78 | opensource@google.com. 79 | 80 | We will investigate every complaint, but you may not receive a direct response. 81 | We will use our discretion in determining when and how to follow up on reported 82 | incidents, which may range from not taking action to permanent expulsion from 83 | the project and project-sponsored spaces. We will notify the accused of the 84 | report and provide them an opportunity to discuss it before any action is taken. 85 | The identity of the reporter will be omitted from the details of the report 86 | supplied to the accused. In potentially harmful situations, such as ongoing 87 | harassment or threats to anyone's safety, we may take action without notice. 88 | 89 | ## Attribution 90 | 91 | This Code of Conduct is adapted from the Contributor Covenant, version 1.4, 92 | available at 93 | https://www.contributor-covenant.org/version/1/4/code-of-conduct/ 94 | -------------------------------------------------------------------------------- /prompts/dft_structure_eval_output_1_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. You will be given one json format property or metadata from a paper by human annotators as the ground truth, 2 | and a list of json format properties or metadata extracted from the same paper by a LLM model. 3 | You goal is to find which json in the json list is most similar to the ground truth json. 4 | 5 | You will do so by comparing whether most non-NaN attributes are similar (a match) for the two jsons. 6 | You do not need to consider about attributes that are NaN when comparing. 7 | 8 | If most non-NaN attributes are similar, output the index of the json that pairs with the ground truth as "json_extracted_index", 9 | and details of comparison as "compare". 1 means similar or same, and 0 means totally different or NaN. The format will look like: 10 | { 11 | "json_extracted_index": , 12 | "compare": { 13 | 'additional_information': 0, 14 | 'cas_number': 0, 15 | 'cell_size': 0, 16 | 'common_name': 1, 17 | 'composition': 1, 18 | 'crystal_or_isolated': 1, 19 | 'crystal_structure': 0, 20 | 'isomer': 0, 21 | 'lattice_parameters': 0, 22 | 'miller_indices': 0, 23 | 'mp_id': 0, 24 | 'scientific_name': 0, 25 | 'space_group': 1, 26 | 'supercell': 0, 27 | 'type': 'bulk', 28 | 'vacuum': 0 29 | } 30 | } 31 | 32 | Otherwise, if no json in the list is similar to the ground truth, simply output an empty json {}. 33 | 34 | Here are some detailed rules you have to follow: 35 | 1. Output json "compare" field should have mostly the same keys as json_ground_truth. 36 | 2. Output values for keys under "compare" are 0 or 1. Set 1 if json values are similar and non empty or NaN for the same key between ground truth and given jsons. Set 0 otherwise. 37 | 3. Be lenient when comparing texts. Tend to consider them the similar (value 1) as long as they have overlapping parts, regardless of formats. 38 | 4. Synonyms can be treated as being the similar. 39 | 5. For "json_extracted_index", use 0-based numbering. 40 | 41 | For example, if we have the following ground truth json and the list of llm extracted jsons. 42 | -------------------------------------------- 43 | json_ground_truth: 44 | { 45 | 'additional_information': 'NaN', 46 | 'cas_number': 'NaN', 47 | 'cell_size': 'NaN', 48 | 'common_name': 'lanthanam lithium oxyhydride', 49 | 'composition': 'La_2LiHO_3', 50 | 'crystal_or_isolated': 'Crystal', 51 | 'crystal_structure': 'NaN', 52 | 'isomer': 'NaN', 53 | 'lattice_parameters': 'NaN', 54 | 'miller_indices': 'NaN', 55 | 'mp_id': 'NaN', 56 | 'scientific_name': 'NaN', 57 | 'space_group': 'Immm', 58 | 'supercell': 'NaN', 59 | 'type': 'bulk', 60 | 'vacuum': 'NaN' 61 | } 62 | -------------------------------------------- 63 | json_extracted_list: 64 | [ 65 | { 66 | 'additional_information': 'NaN', 67 | 'cas_number': 'NaN', 68 | 'cell_size': 'NaN', 69 | 'common_name': 'Lanthanam Lithium Oxyhydride', 70 | 'composition': 'La2LiHO3', 71 | 'crystal_or_isolated': 'crystal', 72 | 'crystal_structure': 'NaN', 73 | 'isomer': 'NaN', 74 | 'lattice_parameters': 'NaN', 75 | 'miller_indices': 'NaN', 76 | 'mp_id': 'NaN', 77 | 'scientific_name': 'NaN', 78 | 'space_group': 'Immm', 79 | 'supercell': 'NaN', 80 | 'type': 'bulk', 81 | 'vacuum': 'NaN' 82 | }, 83 | { 84 | 'additional_information': 'NaN', 85 | 'cas_number': 'NaN', 86 | 'cell_size': 'NaN', 87 | 'common_name': 'Lanthanam Lithium Oxyhydride', 88 | 'composition': 'La2LiHO3', 89 | 'crystal_or_isolated': 'crystal', 90 | 'crystal_structure': 'NaN', 91 | 'isomer': 'NaN', 92 | 'lattice_parameters': 'NaN', 93 | 'miller_indices': 'NaN', 94 | 'mp_id': 'NaN', 95 | 'scientific_name': 'NaN', 96 | 'space_group': 'I4/mmm', 97 | 'supercell': 'NaN', 98 | 'type': 'bulk', 99 | 'vacuum': 'NaN' 100 | }, 101 | ... 102 | ] 103 | -------------------------------------------- 104 | 105 | By comparing the two jsons, the output will be output json: 106 | 107 | Output: 108 | { 109 | "json_extracted_index": 0, 110 | "compare": { 111 | 'additional_information': 0, 112 | 'cas_number': 0, 113 | 'cell_size': 0, 114 | 'common_name': 1, 115 | 'composition': 1, 116 | 'crystal_or_isolated': 1, 117 | 'crystal_structure': 0, 118 | 'isomer': 0, 119 | 'lattice_parameters': 0, 120 | 'miller_indices': 0, 121 | 'mp_id': 0, 122 | 'scientific_name': 0, 123 | 'space_group': 1, 124 | 'supercell': 0, 125 | 'type': 'bulk', 126 | 'vacuum': 0 127 | } 128 | } 129 | 130 | 131 | With the given format, please output accordingly for the following jsons: 132 | -------------------------------------------- 133 | json_ground_truth: 134 | {{json_ground_truth}} 135 | -------------------------------------------- 136 | json_generated: 137 | {{json_extracted_list}} 138 | -------------------------------------------- 139 | 140 | Please only output your answer in the exact format as shown above without any prefix or suffix. 141 | Output: 142 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning 2 | 3 | Evaluation Code accompanying the paper 4 | 5 | [**CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning**](https://arxiv.org/abs/2503.13517) 6 | [ICLR 2025](https://iclr.cc/Conferences/2025) 7 | [Paper](https://arxiv.org/abs/2503.13517) | [Poster](extras/ICLR_2025_poster_CURIE.pdf ) | [Slides](extras/CURIE_ICLR2025_deck_shared.pdf) 8 | 9 | > **TL;DR:** we introduce CURIE (Scientific Long **C**ontext **U**nderstanding **R**easoning and **I**nformation **E**xtraction), benchmark with 10 tasks from 6 science domains specifically designed to test the ability of LLMs to assist scientists in realistic workflows. 10 | 11 | CURIE benchmark encompasses 10 tasks, with a total of 580 input and solution pairs based on 429 research documents across six
 12 | diverse scientific disciplines: materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins – covering both experimental and theoretical aspects of scientific research. The average length of the input queries in CURIE is about 15k words, and the ground truth responses contain on average 954 words. 13 | 14 | (a) CURIE benchmark encompasses 10 tasks, with a total of 580 input and solution 15 | pairs based on 429 research documents across six diverse scientific disciplines: 16 | materials science, theoretical condensed matter physics, quantum computing, 17 | geospatial analysis, biodiversity, and proteins – covering both experimental and 18 | theoretical aspects of scientific research. (b) The average length of the input 19 | queries in CURIE is about 15k words, and (c) the ground truth responses contain 20 | on average 954 words. 21 | 22 | ## 🗄️ Data 23 | 24 | Our data is organized into eight domain-specific subfolders: "biogr", "dft", "pdb", "geo", "mpve", "qecc_65", "hfd", and "hfe". Each subfolder contains two further subfolders: "ground_truth" and "inputs". Within these, each data instance is stored in a JSON file named record_id.json, where record_id is a unique identifier. The "biogr" domain also includes image inputs as record_id.png files alongside the corresponding JSON. 25 | 26 | ```bash 27 | data 28 | ├── domain 29 | ├── inputs 30 | │ └── record_id.json 31 | └── ground_truth 32 | └── record_id.json 33 | └── difficulty_levels.json 34 | 35 | ``` 36 | 37 | Ground truth data varies in structure and content across domains, but all files consistently include a record_id field matching the filename. Input files have a uniform structure across all domains, containing both a record_id field and a text field representing the input text to LLMs. 38 | 39 | For the "biogr" (geo-referencing) task, for 114 of the 138 examples, we release additional data including the PDF papers that each image was taken from along with other metadata in this Github repo: [https://github.com/google-research/ecology-georeferencing](https://github.com/google-research/ecology-georeferencing) 40 | 41 | ## 🧪 Running Inference. 42 | Example Colab notebook hat runs inference by iterating over all examples and prompts for all tasks is provided at code/curie_inference.ipynb. 43 | To execute it: 44 | Add your API key for the model. 45 | Connect to the default runtime ("Python 3 Google Compute Engine backend"). 46 | In the "params" cell, configure the following: 47 | root_path: Path to the data folder. 48 | 49 | 50 | ## 🧪 Running eval. 51 | Our evaluation Colab notebook is provided at code/curie_run_eval.ipynb. To execute it: 52 | Connect to the default runtime ("Python 3 Google Compute Engine backend"). 53 | In the "params" cell, configure the following: 54 | root_path: Path to the data folder. 55 | domain: The target domain (e.g., "biogr", "dft"). 56 | llm: The Large Language Model to evaluate. 57 | prompt: The prompt used for the LLM. 58 | record_id: The ID of the record to evaluate. 59 | Run the Colab. Evaluation metrics will be printed at the end of the notebook. 60 | 61 | Note: Evaluating the "dft" and "mpve" tasks using the LLMSim score requires querying LLMs and therefore requires setting up a Google API key. 62 | 63 | 64 | ## 📊 Generating tables and plots. 65 | 66 | To generate the tables and plots in the paper use the notebook code/curie_generate_tables_figures.ipynb 67 | 68 | ## 📝 TODOs 69 | 70 | - [ ] Release responses by baselines to fully reproduce the reported numbers. 71 | - [x] Add folder with data. 72 | - [x] Update evals to include all metrics. 73 | - [x] Example Colab to run inference. 74 | - [x] Colab to run evaluation. 75 | - [x] Colab to generate all plots and tables. 76 | 77 | ## ✉️ Contact 78 | 79 | This repository is created and maintained by [Subhashini](https://vsubhashini.github.io/). Questions and discussions are welcome under issues. 80 | 81 | ## 🙏 Acknowledgements 82 | 83 | We are grateful to the many domain experts who have contributed to the creation 84 | of the benchmark and evaluations. 85 | 86 | ## 📄 License 87 | 88 | Code in this Github repository is licensed under a [APACHE 2.0 License](./LICENSE). 89 | 90 | ## 🎓 Citing CURIE 91 | 92 | ``` 93 | @inproceedings{cui2025curie, 94 | title={CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning}, 95 | author={Cui, Hao and Shamsi, Zahra and Cheon, Gowoon and Ma, Xuejian and Li, Shutong and Tikhanovskaya, Maria and Norgaard, Peter Christian and Mudur, Nayantara and Plomecka, Martyna Beata and Raccuglia, Paul and others}, 96 | booktitle={The Thirteenth International Conference on Learning Representations} 97 | year={2025} 98 | } 99 | ``` 100 | 101 | *This is not an officially supported Google product.* 102 | -------------------------------------------------------------------------------- /prompts/describe_code_in_paper.txt: -------------------------------------------------------------------------------- 1 | Fill in a YAML file for the code described in the attached paper according to the prescription defined in the YAML template that starts on the next paragraph. Fields are to be filled only if they are directly relevant to the code introduced in the paper. Be sure to extract any quantitative data like thresholds and code rates. Above all, be concise! If you cannot explain something technical in detail, do not try to explain it. If something is not detailed in the paper, do not mention it. 2 | 3 | ####################################################### 4 | ## This is a code entry in the error correction zoo. ## 5 | ## https://github.com/errorcorrectionzoo ## 6 | ####################################################### 7 | 8 | # Use UTF-8 unicode encoding 9 | # AMS-TeX commands are rendered inside \( ... \) using MathJaX. 10 | # Allowed external bibliographic references are 11 | # \cite{arXiv:#.#} or \cite{arXiv:quant-ph/#} (PREFERRED), 12 | # \cite{doi:#}, or, as a last resort 13 | # \cite{manual:{(enter citation line incl. author and year here)}} 14 | # External websites such as code tables, coding theory packages, github pages linked as 15 | # \\url{https://example.com/example} 16 | # \href{https://example.com/example}{link text} 17 | # Internal references to codes are 18 | # \hyperref[code:code_id]{link text} 19 | # Delete instructional comments when submitting 20 | 21 | # code id, physical, logical are all lower case 22 | # physical or logical are one of the following: bits, q-ary_digits, matrices, rings, reals, spheres, qubits, qudits, galois, oscillators, spins, or categories 23 | code_id: no_spaces_lower_case 24 | physical: qubits 25 | logical: qubits 26 | 27 | # Only list if the code being described has specific parameters. These are typically in the form of (n,K,d) and [n,k,d] for classical codes, or ((n,K,d)) and [[n,k,d]] for quantum codes. 28 | Code_parameter: '((2^r-1,r,d))_{6}' 29 | 30 | # Apostrophes are denoted by two apostrophe characters, i.e., '' 31 | # Code title (SINGULAR) + first reference(s) (optional). 32 | name: 'Important-last-name favorite code' 33 | introduced: '\cite{doi:10.1070/RM1997v052n06ABEH002155}' 34 | 35 | # Anything applicable to a larger parent set of codes (see below) should go in 36 | # that entry instead of here. 37 | description: | 38 | First paragraph is a short standalone description, containing no references to figures. 39 | 40 | Subsequent paragraphs go into (possibly quantitative) details. 41 | 42 | \subsection{Subsection headings work} 43 | \paragraph{And so do paragraphs} 44 | # Only add subsections or paragraphs if the paper has long discussions about broad code classes. 45 | 46 | 47 | # Long fields such as this one can be written in other YML formats, such as the one using the pipe symbol 48 | # protection: | 49 | # text... 50 | # more text... 51 | protection: 'Protects against ... Pauli noise. Approximate code with parameters ... for noise model ... .' 52 | 53 | # This field starts a list of specific labeled subfields; do not leave it empty. If empty, comment out. Also, indentations are important! 54 | features: 55 | 56 | # Do not include this if no specific encoders are mentioned. 57 | encoders: 58 | - 'Specific description of a process that makes the states, usually for quantum codes.' 59 | - 'Unitary circuit of depth ... \cite{arxiv:old-paper}.' 60 | - 'Measurement-based preparation ... with ancilla overhead of ... .' 61 | - 'Leave discussion of fault tolerance to fault-tolerance field.' 62 | 63 | # Not all fields are indexed by a dash 64 | transversal_gates: 'Transversal ... gates \cite{doi:ok-paper}. Comment out if doesn''t apply.' 65 | 66 | # Do not include this if no specific gates are mentioned. 67 | general_gates: 68 | - 'Universal gate set achieved by either additional ... gate.' 69 | - 'Magic-state distillation protocols' 70 | - 'kth \term{Clifford hierarchy} gates obtained by ... circuits' 71 | 72 | # Do not include this if no specific decoders are mentioned. 73 | decoders: 74 | - 'Details about how syndrome measurements are done; discuss overhead, if applicable.' 75 | - 'MWPM decoding algorithm \cite{doi:good-paper} with ... overhead.' 76 | - 'Just-in-time decoder with ... \cite{arxiv:awesome-paper}.' 77 | 78 | fault_tolerance: 79 | - 'Transversal gates are fault-tolerant w.r.t. ... noise \cite{doi:ok-paper}' 80 | - 'Other fault-tolerant gadgets (measurements, encoders, error correcting steps)' 81 | - 'Noise-model-preserving gadgets, noise-biased gates, fault-tolerant flag error correction' 82 | - 'Pieceable fault tolerance.' 83 | 84 | code_capacity_threshold: 85 | - '\(1.5%\) error-correction threshold against some noise with *noiseless* decoder of some complexity \cite{arxiv:paper}.' 86 | 87 | threshold: 88 | - '\(0.3\%\) error-correction threshold ... with *noisy* ... decoder of some complexity \cite{doi:good-paper}.' 89 | - '\(10^{-5}\) computational threshold using concatenated scheme under ... noise with overhead of ... ' 90 | 91 | # Include only if specific experimental or real-world realizations are reported. 92 | realizations: 93 | # List and explain the different "domains" of realizations in list items. 94 | - 'Code used in DVDs \cite{doi:####...}, 5G, etc.' 95 | - 'Realized in trapped-ion quantum devices \cite{arXiv:####.#####}, etc.' 96 | 97 | # Only include notes if the specific technical items listed below are included in the paper. 98 | notes: 99 | - 'Bounds on \(n\), \(k\), or \(d\) for this class, unless mentioned in description.' 100 | - 'Links to code tables, github, GAP algebra packages, more papers \cite{arXiv:####.#####}.' 101 | 102 | # Include as many direct relations as were mentioned in the paper. The relations below are just examples. 103 | relations: 104 | parents: 105 | - code_id: code_id1 106 | detail: 'The smallest code family that includes this code that is defined over the same physical space structure or alphabet.' 107 | cousins: 108 | - code_id: code_id2 109 | detail: 'Codes that are directly relevant and described by a property shared by this code.' 110 | - code_id: code_id3 111 | detail: 'Code family of similar encoding but with different physical space structures (qudit vs. qubit surface code).' 112 | 113 | # Include footer below and change the date to today’s date in the prescribed format 114 | # Begin Entry Meta Information 115 | _meta: 116 | # Change log - most recent first 117 | changelog: 118 | - user_id: VictorVAlbert 119 | date: 'YYYY-MM-DD' 120 | 121 | Here is the paper 122 | 123 | {{text}} 124 | -------------------------------------------------------------------------------- /prompts/mat_eval_output_1_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. You will be given one json format material property extracted from a paper by human annotators as the ground truth, 2 | and a list of json format material properties extracted from the same paper by a LLM model. 3 | You goal is to find which json in the json list is the same as the ground truth json. 4 | 5 | You will do so by comparing whether the following core attributes are the same (a match) for the two jsons: 6 | "material", "proeprty_name", "value_units", "high_value", "low_value". 7 | Also, pay attention to "material_descriptor" and "property_descriptor", as they might suggest the material is in a totally different form or state. 8 | Consider them mismatchs if the descriptors suggest so. 9 | 10 | If those above attributes are the same, output the index of the json that pairs with the ground truth as "json_extracted_index", 11 | and details of comparison as "compare". The format will look like: 12 | { 13 | "json_extracted_index": , 14 | "compare": { 15 | "high_value": 1, 16 | "low_value": 1, 17 | "material": 1, 18 | "material_descriptor": 0, 19 | "material_source_passage": 0, 20 | "material_source_table": 0, 21 | "property_descriptor": 0, 22 | "property_name": 1, 23 | "property_source_passage": 1, 24 | "property_source_table": 0, 25 | "value_source_passage": 1, 26 | "value_source_table": 0, 27 | "value_units": 1 28 | } 29 | } 30 | 31 | Otherwise, simply output an empty json {}. 32 | 33 | Here are some detailed rules you have to follow: 34 | 1. Output json "compare" field should have the same keys as json_ground_truth. 35 | 2. Output values for keys under "compare" are 0 or 1. Set 1 if json values are the same and non empty for the same key between ground truth and given jsons. Set 0 otherwise. 36 | 3. Be lenient when comparing passages. Tend to consider them the same (value 1) as long as they have overlapping parts, regardless of formats. 37 | 4. For material names, synonyms can be considered the same, such as "InN" and "Indium Nitride". 38 | 5. For "json_extracted_index", use 0-based numbering. 39 | 40 | For example, if we have the following ground truth json and the list of llm extracted jsons. 41 | -------------------------------------------- 42 | json_ground_truth: 43 | { 44 | "high_value": "0.69", 45 | "low_value": "0.69", 46 | "material": "Indium Nitride", 47 | "material_descriptor": " ", 48 | "material_source_passage": "Indium Nitride (InN), with a band gap of ~0.69 eV [1, 2] has become the focus of increased attention among the III-N compounds due to its potential for near-infrared optoelectronic devices or high efficiency solar cells [3]. Moreover, the combination of the intrinsic properties of InN with quantum phenomena [4], resulting from the growth of self-assembled quantum dots (QDs), promises further applications.", 49 | "material_source_table": " ", 50 | "property_descriptor": " ", 51 | "property_name": "band gap", 52 | "property_source_passage": "Indium Nitride (InN), with a band gap of ~0.69 eV [1, 2] has become the focus of increased attention among the III-N compounds due to its potential for near-infrared optoelectronic devices or high efficiency solar cells [3].", 53 | "property_source_table": " ", 54 | "value_source_passage": "Indium Nitride (InN), with a band gap of ~0.69 eV [1, 2] has become the focus of increased attention among the III-N compounds due to its potential for near-infrared optoelectronic devices or high efficiency solar cells [3].", 55 | "value_source_table": " ", 56 | "value_units": "eV" 57 | } 58 | -------------------------------------------- 59 | json_extracted_list: 60 | [{ 61 | "json_extracted_index": 0, 62 | "high_value": "0.69", 63 | "low_value": "0.69", 64 | "material": "InN", 65 | "material_descriptor": "quantum dots", 66 | "material_source_passage": "We present a study by transmission electron microscopy (TEM) of the strain state of individual InN quantum dots (QDs) grown on GaN substrates.", 67 | "material_source_table": "", 68 | "property_descriptor": "", 69 | "property_name": "band gap", 70 | "property_source_passage": "Indium Nitride (InN), with a band gap of *0.69 eV [1, 2] has become the focus of increased attention among the III-N compounds due to its potential for near-infrared optoelectronic devices or high efficiency solar cells [3].", 71 | "property_source_table": "", 72 | "value_source_passage": "Indium Nitride (InN), with a band gap of *0.69 eV [1, 2] has become the focus of increased attention among the III-N compounds due to its potential for near-infrared optoelectronic devices or high efficiency solar cells [3].", 73 | "value_source_table": "", 74 | "value_units": "eV" 75 | }, { 76 | "json_extracted_index": 1, 77 | "high_value": "1", 78 | "low_value": "1", 79 | "material": "GaN", 80 | "material_descriptor": "buffer layer", 81 | "material_source_passage": "InN quantum dots samples were grown by Metalorganic Vapor Phase Epitaxy (MOVPE) on GaN/sapphire substrates. A thick (*1 lm) buffer layer of GaN was grown on (0001) sapphire using the usual two-step process [8] at a temperature close to 1,000°C.", 82 | "material_source_table": "", 83 | "property_descriptor": "", 84 | "property_name": "thickness", 85 | "property_source_passage": "InN quantum dots samples were grown by Metalorganic Vapor Phase Epitaxy (MOVPE) on GaN/sapphire substrates. A thick (*1 lm) buffer layer of GaN was grown on (0001) sapphire using the usual two-step process [8] at a temperature close to 1,000°C.", 86 | "property_source_table": "", 87 | "value_source_passage": "InN quantum dots samples were grown by Metalorganic Vapor Phase Epitaxy (MOVPE) on GaN/sapphire substrates. A thick (*1 lm) buffer layer of GaN was grown on (0001) sapphire using the usual two-step process [8] at a temperature close to 1,000°C.", 88 | "value_source_table": "", 89 | "value_units": "µm" 90 | }, ... 91 | ] 92 | -------------------------------------------- 93 | 94 | By comparing the two jsons, the output will be output json: 95 | 96 | Output: 97 | { 98 | "json_extracted_index": 0, 99 | "compare": { 100 | "high_value": 1, 101 | "low_value": 1, 102 | "material": 1, 103 | "material_descriptor": 0, 104 | "material_source_passage": 0, 105 | "material_source_table": 0, 106 | "property_descriptor": 0, 107 | "property_name": 1, 108 | "property_source_passage": 1, 109 | "property_source_table": 0, 110 | "value_source_passage": 1, 111 | "value_source_table": 0, 112 | "value_units": 1 113 | } 114 | } 115 | 116 | 117 | With the given format, please output accordingly for the following jsons: 118 | -------------------------------------------- 119 | json_ground_truth: 120 | {{json_ground_truth}} 121 | -------------------------------------------- 122 | json_generated: 123 | {{json_extracted_list}} 124 | -------------------------------------------- 125 | 126 | Please only output your answer in the exact format as shown above without any prefix or suffix. 127 | Output: 128 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /colabs/Example_PDB_long_llama_inference.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1mbYaiiytf-c7e8tNpzcppHwxks0_SuLb","timestamp":1717484837203},{"file_id":"1yVNLy6YKtkJwk4kdwcUW30Hrzt1tBJuk","timestamp":1717356613138},{"file_id":"1_mWSaqR67znLRx3euZmMcom3_3nChoNm","timestamp":1717352161145},{"file_id":"1upwIkRZsBsSQ9I7mc11bk6ClawYvG7dV","timestamp":1717298974033},{"file_id":"175zL8DNlwGQIko4nArl_YkMXc7nqsvZ_","timestamp":1717211525893},{"file_id":"1FU5niwb_mlb5nlu0wDV3L3P49tHBE5EP","timestamp":1717205296074},{"file_id":"1uOLJANGUtzmXHymtfJszovrHXeE4Pt0d","timestamp":1717190289746},{"file_id":"https://github.com/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb","timestamp":1716452980202}],"gpuType":"A100","machine_shape":"hm","private_outputs":true},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["# Task: PDB\n","## Modifying LongLLaMA: Focused Transformer Training for Context Scaling\n","\n","**Original Notebook**: https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_code_instruct_colab.ipynb\n","\n","\n","References:\n","* [LongLLaMA-Instruct-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_instruct)\n","* [FoT paper](https://arxiv.org/abs/2307.03170) and [GitHub repository](https://github.com/CStanKonrad/long_llama)"],"metadata":{"id":"blR8hhA_e-4S"}},{"cell_type":"markdown","source":["# Setup"],"metadata":{"id":"ZTROtgpafB_8"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"La3cPHfCe7hU"},"outputs":[],"source":["!pip install --upgrade pip\n","!pip install transformers==4.30.0 sentencepiece accelerate -q"]},{"cell_type":"code","source":["import numpy as np\n","import torch\n","from transformers import LlamaTokenizer, AutoModelForCausalLM, TextStreamer, PreTrainedModel, PreTrainedTokenizer\n","from typing import List, Optional\n","import os"],"metadata":{"id":"0x_utNYxfECA"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["os.listdir(os.getcwd())"],"metadata":{"id":"2A0YH8yf4K_f"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["MODEL_PATH = (\n"," \"syzymon/long_llama_3b_instruct\"\n",")\n","TOKENIZER_PATH = MODEL_PATH\n","# to fit into colab GPU we will use reduced precision\n","TORCH_DTYPE = torch.bfloat16\n","\n","if torch.cuda.is_available():\n"," device = torch.device(\"cuda\")\n","else:\n"," device = torch.device(\"cpu\")"],"metadata":{"id":"LrrFlXKMfHMs"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["device"],"metadata":{"id":"zVkJN0N96cPj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)\n","\n","model = AutoModelForCausalLM.from_pretrained(\n"," MODEL_PATH,\n"," torch_dtype=TORCH_DTYPE,\n"," device_map=device,\n"," trust_remote_code=True,\n"," # mem_attention_grouping is used\n"," # to trade speed for memory usage\n"," # for details, see the section Additional configuration\n"," mem_attention_grouping=(1, 2048),\n",")\n","model.eval()"],"metadata":{"id":"5fE5Z1ABfJUp"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Load Input Documents (usually Papers)"],"metadata":{"id":"ElblcG7MfOM4"}},{"cell_type":"code","source":["import os"],"metadata":{"id":"kP6dd87I78Yq"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Specify directory here"],"metadata":{"id":"SuZ0JisLWGs6"}},{"cell_type":"code","source":["from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"id":"NyPVAae8USwJ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Specify paths containing the input JSONS and where the results should be saved."],"metadata":{"id":"v8XfPgoaXCSX"}},{"cell_type":"code","source":["# @title Input and Result Output Dirs\n","INPUT_RTDIR = '/content/drive/My Drive/local_benchmark/' # @param {type:\"string\"}\n","DIRPATH = '/content/drive/My Drive/local_benchmark/results/' # @param {type:\"string\"}"],"metadata":{"cellView":"form","id":"Yt4-kncbWlDU"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["os.listdir(INPUT_RTDIR)"],"metadata":{"id":"63IT20cqWvUG"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#DIRPATH = 'inference/t_1/'"],"metadata":{"id":"VBeWnZwG5ZsB"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title Task Specific Config\n","TASK_NAME = \"pdb\" # @param {type:\"string\"}\n","PROMPT_NAME = \"reconstruct_protein_amino_acid_sequence_0_shot\" # @param {type:\"string\"}\n","PROMPT_PATH = PROMPT_NAME + \".txt\""],"metadata":{"id":"SNezBvLsPHh1"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["INPUT_DIR= f'{INPUT_RTDIR}/{TASK_NAME}/inputs/'"],"metadata":{"id":"8zwTeM3Ajm3V"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Modifications for Paper, Prompt order:\n","* below are the tertiary structure -> The PROTEIN TERTIARY STRUCTURE is provided above.\n","* Append \"PROTEIN TERTIARY STRUCTURE\" to input['text']\n","* AMINO ACID SEQUENCE:"],"metadata":{"id":"PE6Bqaetd2vj"}},{"cell_type":"code","source":["PMODIFIED = \"\"\"You are a computational biologist and I want you to reconstruct a protein's amino acid sequence from its tertiary structure.\n","* The input is a PDB that is a textual format describing the three-dimensional structures of a protein.\n","* Return the amino acid sequence in the standard FASTA format, which starts with a definition line with the greater than (>) line,\n"," followed by the single-letter codes for all amino acids in the second line.\n","* Make sure the amino acid sequence is in the second line.\n","* If there is an unknown amino acid in the structure, put \"X\" in the sequence.\n","* Make sure you go through the whole structure and get all the amino acids.\n","* No extra explanation is needed.\n","\n","The PROTEIN TERTIARY STRUCTURE is provided above.\n","\n","AMINO ACID SEQUENCE:\n","\"\"\""],"metadata":{"id":"k0K5cMNCd2vj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["PREFIX = \"PROTEIN TERTIARY STRUCTURE: \""],"metadata":{"id":"oY5CI6E0Es-p"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import json\n","import os"],"metadata":{"id":"1nb9zULE8lE2"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["os.listdir(INPUT_DIR), len(os.listdir(INPUT_DIR))"],"metadata":{"id":"_pqlrcjo2JFn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["def get_paper_list(inputdir):\n"," files = os.listdir(inputdir)\n"," papers = []\n"," for f in files:\n"," if f.endswith('.json'):\n"," papers.append(f[:f.rindex(\".json\")])\n"," return papers"],"metadata":{"id":"HssNqrtXcohP"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# modified output dict metadata prep\n","def prepare_task_for_paper(paper: str, prompt_path: str, lm_id: str)-> dict[str, str]:\n"," paper_input = f'{INPUT_DIR}/{paper}.json'\n"," inputs = json.load(open(paper_input, 'r'))\n","\n"," return {'record_id': inputs['record_id'], 'model_id': lm_id, 'prompt_path': prompt_path,\n"," 'prompt_text': PREFIX + inputs['text'] + PMODIFIED , 'response_text': ''}"],"metadata":{"id":"FiRm3uas3cp1"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Run on all sequences"],"metadata":{"id":"chPK2r8CfRcC"}},{"cell_type":"code","source":["from io import StringIO\n","import sys"],"metadata":{"id":"I1fDgI7d8Ai3"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import os\n","#for the paper\n","@torch.no_grad()\n","def load_to_memory(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, text: str):\n"," tokenized_data = tokenizer(text, return_tensors=\"pt\")\n"," input_ids = tokenized_data.input_ids\n"," input_ids = input_ids.to(model.device)\n"," # torch.manual_seed(0)\n"," output = model(input_ids=input_ids)\n"," memory = output.past_key_values\n"," return memory"],"metadata":{"id":"rt0PMfKDfLjy"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["@torch.no_grad()\n","def generate_with_memory_new(\n"," model: PreTrainedModel, tokenizer: PreTrainedTokenizer, memory, prompt: str, temperature=1.0\n","):\n"," tokenized_data = tokenizer(prompt, return_tensors=\"pt\")\n"," input_ids = tokenized_data.input_ids\n"," input_ids = input_ids.to(model.device)\n","\n"," streamer = TextStreamer(tokenizer, skip_prompt=False)\n","\n"," new_memory = memory\n","\n"," catch_stout = StringIO()\n"," sys.stdout = catch_stout\n","\n"," stop = False\n"," while not stop:\n"," output = model(input_ids, past_key_values=new_memory)\n"," new_memory = output.past_key_values\n"," assert len(output.logits.shape) == 3\n"," assert output.logits.shape[0] == 1\n"," last_logit = output.logits[[0], [-1], :]\n"," dist = torch.distributions.Categorical(logits=last_logit / temperature)\n"," next_token = dist.sample()\n"," if next_token[0] == tokenizer.eos_token_id:\n"," streamer.put(next_token[None, :])\n"," streamer.end()\n"," stop = True\n"," # Restore stdout to its original state\n"," sys.stdout = sys.__stdout__\n"," else:\n"," input_ids = next_token[None, :]\n"," streamer.put(input_ids)\n"," return catch_stout.getvalue()"],"metadata":{"id":"5XUhGHsRz3gr"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import inspect"],"metadata":{"id":"GzQ6C80W1120"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["PROMPT_PATH"],"metadata":{"id":"e1s0cYDQ6Dd1"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["def run_eval_loop(paper_list: List[str], results_dir: str, temperature: float):\n"," for PAPER in paper_list:\n"," print(PAPER)\n"," outpath = f'{results_dir}/{PAPER}.json'\n"," if os.path.exists(outpath):\n"," print(f'Skipping since result for {PAPER} already exists.')\n"," else:\n"," inputs = json.load(open(f'{INPUT_DIR}/{PAPER}.json', 'r'))\n"," out_dict = prepare_task_for_paper(paper=PAPER, prompt_path=PROMPT_PATH, lm_id=MODEL_PATH)\n","\n"," fot_memory = load_to_memory(model, tokenizer, PREFIX + inputs['text']) # loads the paper to memory\n"," answer = generate_with_memory_new(model, tokenizer, fot_memory, PMODIFIED, temperature) #asks the prompt after\n"," out_dict['response_text'] = answer\n"," json.dump(out_dict, open(outpath, 'w'))\n"," return"],"metadata":{"id":"ZMzjo3EycvHG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["d: Which run (trial) this is. If you're running multiple trials of the same experiment."],"metadata":{"id":"8N1A5dRMOIq-"}},{"cell_type":"code","source":["# @title Specify Run_d here\n","trial = \"run_0\" # @param {type:\"string\"}\n","EXP_DIR = f\"{DIRPATH}/{TASK_NAME}/{PROMPT_NAME}/longllama/{trial}/success/\""],"metadata":{"id":"naUs15IAe1I1"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["print(EXP_DIR)"],"metadata":{"id":"rq5XoUUjGl9C"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["os.makedirs(EXP_DIR, exist_ok=True)"],"metadata":{"id":"477c0EetJo7V"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["PAPERS = get_paper_list(INPUT_DIR)\n","print(len(PAPERS))"],"metadata":{"id":"v2Z4kTuMAuJA"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now run on all papers"],"metadata":{"id":"87KhE0wcD8mD"}},{"cell_type":"code","source":["print(EXP_DIR)"],"metadata":{"id":"WZqxTT4RHeuj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["run_eval_loop(PAPERS, EXP_DIR, 1.0)"],"metadata":{"id":"ag2_yMVFHmlp"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Aside: Handling failures if any"],"metadata":{"id":"XSIB_1PYSIUv"}},{"cell_type":"code","source":["PAPERS_FAILED = ['18', '19', '20', '7', '14', '5', '21']"],"metadata":{"id":"dVCv4KCA2_m6"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["PAPERS_SUCCESS = PAPERS.copy()\n","for p in PAPERS_FAILED:\n"," PAPERS_SUCCESS.remove(p)"],"metadata":{"id":"7BJPL0Xj3P_9"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["len(PAPERS_SUCCESS)"],"metadata":{"id":"PRnpiRcQ3WNI"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["torch.cuda.empty_cache()"],"metadata":{"id":"BNOg3JBnOwTw"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["run_eval_loop(PAPERS_SUCCESS, EXP_DIR, 1.0)"],"metadata":{"id":"VJ4lSS3wLBtR"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["PAPERS_SUCCESS"],"metadata":{"id":"GGOaUkMRM0_F"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["os.listdir(EXP_DIR)"],"metadata":{"id":"krFzGVOItsSv"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Render Outputs"],"metadata":{"id":"Z3laZTRma8Y2"}},{"cell_type":"code","source":["test_paper = PAPERS[2]"],"metadata":{"id":"yRZjD7VSC7or"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["sd0 = json.load(open(f'{EXP_DIR}/{test_paper}.json', 'r'))"],"metadata":{"id":"wY69Uow0DnfH"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["sd0['response_text']"],"metadata":{"id":"DmEj1MD7D1vj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"UnEzLqukVUm2"},"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /colabs/CURIEbenchmark_inference_Command_R_Plus.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["## Notebook for processesing CURIE-benchmark tasks using the **Cohere Command-R Plus model**.\n"],"metadata":{"id":"fGW0Oeb2IxMX"}},{"cell_type":"code","source":["# @title Import Required Libraries\n","import os\n","import json\n","import pandas as pd\n","import numpy as np\n","import altair as alt\n","import logging\n","import textwrap as tr\n","import torch\n","from google.colab import drive\n","from tenacity import retry, stop_after_attempt, wait_exponential\n","import time\n","from dataclasses import dataclass\n","from typing import Optional, Dict, List\n","from enum import Enum"],"metadata":{"id":"8rsiVfsTG-A-"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title Install and import Cohere\n","! pip install -U cohere\n","import cohere"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"aYK7SbWtG-vu","outputId":"bbef8713-ef2b-49c0-8590-c24cc8ca5216"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting cohere\n"," Downloading cohere-5.13.4-py3-none-any.whl.metadata (3.4 kB)\n","Collecting fastavro<2.0.0,>=1.9.4 (from cohere)\n"," Downloading fastavro-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)\n","Requirement already satisfied: httpx>=0.21.2 in /usr/local/lib/python3.10/dist-packages (from cohere) (0.28.1)\n","Collecting httpx-sse==0.4.0 (from cohere)\n"," Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)\n","Collecting parameterized<0.10.0,>=0.9.0 (from cohere)\n"," Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)\n","Requirement already satisfied: pydantic>=1.9.2 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.10.3)\n","Requirement already satisfied: pydantic-core<3.0.0,>=2.18.2 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.27.1)\n","Requirement already satisfied: requests<3.0.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (2.32.3)\n","Requirement already satisfied: tokenizers<1,>=0.15 in /usr/local/lib/python3.10/dist-packages (from cohere) (0.21.0)\n","Collecting types-requests<3.0.0,>=2.0.0 (from cohere)\n"," Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)\n","Requirement already satisfied: typing_extensions>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from cohere) (4.12.2)\n","Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.21.2->cohere) (3.7.1)\n","Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.21.2->cohere) (2024.12.14)\n","Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.21.2->cohere) (1.0.7)\n","Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.21.2->cohere) (3.10)\n","Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.21.2->cohere) (0.14.0)\n","Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.9.2->cohere) (0.7.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.0.0->cohere) (3.4.0)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.0.0->cohere) (2.2.3)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers<1,>=0.15->cohere) (0.27.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers<1,>=0.15->cohere) (3.16.1)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers<1,>=0.15->cohere) (2024.10.0)\n","Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers<1,>=0.15->cohere) (24.2)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers<1,>=0.15->cohere) (6.0.2)\n","Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->tokenizers<1,>=0.15->cohere) (4.67.1)\n","Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.21.2->cohere) (1.3.1)\n","Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.21.2->cohere) (1.2.2)\n","Downloading cohere-5.13.4-py3-none-any.whl (250 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m250.0/250.0 kB\u001b[0m \u001b[31m11.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)\n","Downloading fastavro-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m75.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB)\n","Downloading types_requests-2.32.0.20241016-py3-none-any.whl (15 kB)\n","Installing collected packages: types-requests, parameterized, httpx-sse, fastavro, cohere\n","Successfully installed cohere-5.13.4 fastavro-1.10.0 httpx-sse-0.4.0 parameterized-0.9.0 types-requests-2.32.0.20241016\n"]}]},{"cell_type":"code","source":["# @title API Configuration\n","API_KEY = \"YOUR_API_KEY\"\n","MODEL_PATH = 'command-r-plus'\n","co_v2 = cohere.ClientV2(api_key=API_KEY)"],"metadata":{"id":"lw-KLn5zHFmx"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title Mount Google Drive\n","drive.mount('/content/drive', force_remount=True)\n","os.chdir(\"/content/drive/My Drive\")"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Ev-eMAHRHJl3","outputId":"4d0f2a09-c59f-4868-e642-b01d5819a386"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","source":["# @title Configuration Classes\n","@dataclass\n","class ExperimentConfig:\n"," \"\"\"Configuration class for experiment settings\"\"\"\n"," name: str\n"," base_dir: str\n"," inference_dir: str\n"," prompt_path: str\n","\n","class ExperimentType(Enum):\n"," \"\"\"Enum for different types of experiments\"\"\"\n"," PDB = \"pdb\"\n"," MPVE = \"mpve\"\n"," HFE = \"hfe\"\n"," GEO = \"geo\"\n"," DFT = \"dft\"\n"," HFD = \"hfd\"\n"," QECC_PDF = \"qecc_pdf\"\n"," QECC_TEX = \"qecc_tex\""],"metadata":{"id":"jN6DcI1BIP-u"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0K6ocK19G5GM"},"outputs":[],"source":["# @title Experiment Manager Class\n","class ExperimentManager:\n"," \"\"\"Manages different experiment configurations\"\"\"\n"," def __init__(self, base_path: str = \"/content/drive/My Drive\"):\n"," self.base_path = base_path\n"," self.experiments = self._initialize_experiments()\n","\n"," def _initialize_experiments(self) -> Dict[ExperimentType, ExperimentConfig]:\n"," \"\"\"Initialize all experiment configurations\"\"\"\n"," benchmark_path = f\"{self.base_path}/benchmarks\"\n"," return {\n"," ExperimentType.PDB: ExperimentConfig(\n"," name=\"PDB\",\n"," base_dir=f\"{self.base_path}/pdb\",\n"," inference_dir=f\"{self.base_path}/inference/multi_runs/current/pdb_new/reconstruct_protein_amino_acid_sequence_0_shot/\",\n"," prompt_path=f\"{benchmark_path}/prompts/reconstruct_protein_amino_acid_sequence_0_shot.txt\"\n"," ),\n"," ExperimentType.MPVE: ExperimentConfig(\n"," name=\"MPVE\",\n"," base_dir=f\"{benchmark_path}/data/mpve\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/mpve/mat_paper_to_property_1_shot_exclude_trivia/\",\n"," prompt_path=f\"{benchmark_path}/prompts/mat_paper_to_property_1_shot_exclude_trivia.txt\"\n"," ),\n"," ExperimentType.HFE: ExperimentConfig(\n"," name=\"HFE\",\n"," base_dir=f\"{benchmark_path}/data/hfe\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/hfe/extract_hamiltonian_0_shot/\",\n"," prompt_path=f\"{benchmark_path}/prompts/extract_hamiltonian_0_shot.txt\"\n"," ),\n"," ExperimentType.GEO: ExperimentConfig(\n"," name=\"GEO\",\n"," base_dir=f\"{benchmark_path}/data/geo\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/geo/extract_dataset_from_geo_papers_0_shot\",\n"," prompt_path=f\"{benchmark_path}/prompts/extract_dataset_from_geo_papers_0_shot.txt\"\n"," ),\n"," ExperimentType.DFT: ExperimentConfig(\n"," name=\"DFT\",\n"," base_dir=f\"{benchmark_path}/data/dft\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/dft/extract_dft_metadata_1_shot/\",\n"," prompt_path=f\"{benchmark_path}/prompts/extract_dft_metadata_1_shot.txt\"\n"," ),\n"," ExperimentType.HFD: ExperimentConfig(\n"," name=\"HFD\",\n"," base_dir=f\"{benchmark_path}/data/hfd\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/hfd/derivation_prompt/\",\n"," prompt_path=f\"{benchmark_path}/prompts/derivation_prompt.txt\"\n"," ),\n"," ExperimentType.QECC_PDF: ExperimentConfig(\n"," name=\"QECC_PDF\",\n"," base_dir=f\"{benchmark_path}/data/qecc_pdf\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/qecc_pdf/describe_code_in_paper/\",\n"," prompt_path=f\"{benchmark_path}/prompts/describe_code_in_paper.txt\"\n"," ),\n"," ExperimentType.QECC_TEX: ExperimentConfig(\n"," name=\"QECC_TEX\",\n"," base_dir=f\"{benchmark_path}/data/qecc_tex\",\n"," inference_dir=f\"{benchmark_path}/inference/multi_runs/current/qecc_tex/describe_code_in_paper/\",\n"," prompt_path=f\"{benchmark_path}/prompts/describe_code_in_paper.txt\"\n"," )\n"," }\n","\n"," def get_config(self, experiment_type: ExperimentType) -> ExperimentConfig:\n"," \"\"\"Get configuration for specific experiment type\"\"\"\n"," return self.experiments[experiment_type]"]},{"cell_type":"code","source":["# @title Paper Processing Utilities\n","def specialize_prompt(template: str, tag: str, infil: str) -> str:\n"," \"\"\"Replace a tag in a template with provided text.\"\"\"\n"," if tag in template:\n"," return template.replace(tag, infil)\n"," raise ValueError(f'{tag} absent in template.')\n","\n","def prepare_task_for_paper(paper: str, config: ExperimentConfig, model_id: str) -> dict:\n"," \"\"\"Prepare the task information for a given paper.\"\"\"\n"," paper_input = os.path.join(config.base_dir, 'inputs', f'{paper}.json')\n"," paper_gt = os.path.join(config.base_dir, 'ground_truth', f'{paper}.json')\n","\n"," with open(paper_input, 'r') as f:\n"," inputs = json.load(f)\n"," with open(paper_gt, 'r') as f:\n"," targets = json.load(f)\n","\n"," with open(config.prompt_path, 'r') as f:\n"," ptemp = f.read()\n","\n"," spec_prompt = specialize_prompt(ptemp, '{{text}}', infil=inputs['text'])\n","\n"," return {\n"," 'record_id': paper,\n"," 'model_id': model_id,\n"," 'prompt_path': config.prompt_path,\n"," 'prompt_text': spec_prompt,\n"," 'response_text': ''\n"," }"],"metadata":{"id":"02DSqrzbHYUi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title Paper Processor Class\n","class PaperProcessor:\n"," \"\"\"Handles the processing of scientific papers\"\"\"\n","\n"," def __init__(self, api_key: str, model_path: str):\n"," self.co_v2 = cohere.ClientV2(api_key=api_key)\n"," self.model_path = model_path\n"," self._setup_logging()\n","\n"," def _setup_logging(self):\n"," \"\"\"Configure logging settings\"\"\"\n"," logging.basicConfig(\n"," filename='experiment_log.log',\n"," level=logging.INFO,\n"," format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'\n"," )\n"," self.logger = logging.getLogger(__name__)\n","\n"," @retry(\n"," stop=stop_after_attempt(3),\n"," wait=wait_exponential(multiplier=1, min=4, max=10),\n"," reraise=True\n"," )\n"," def _make_api_call(self, messages: List[Dict]) -> str:\n"," \"\"\"Make API call with retry logic\"\"\"\n"," response = self.co_v2.chat(\n"," model=self.model_path,\n"," messages=messages,\n"," temperature=0.9,\n"," k=50,\n"," p=0.95,\n"," max_tokens=4000\n"," )\n"," return self._extract_response_text(response)\n","\n"," def _extract_response_text(self, response) -> str:\n"," \"\"\"Extract text from API response\"\"\"\n"," if hasattr(response, 'message'):\n"," if hasattr(response.message, 'content'):\n"," if isinstance(response.message.content, list):\n"," return ' '.join(item.text for item in response.message.content if hasattr(item, 'text'))\n"," elif isinstance(response.message.content, str):\n"," return response.message.content\n"," return str(response)\n","\n"," def _save_result(self, task_info: dict, inference_dir: str, run_id: int, success: bool = True):\n"," \"\"\"Save processing results\"\"\"\n"," status = 'success' if success else 'failure'\n"," output_dir = os.path.join(inference_dir, self.model_path, f'run_{run_id}', status)\n"," os.makedirs(output_dir, exist_ok=True)\n","\n"," serializable_task_info = {\n"," 'record_id': task_info['record_id'],\n"," 'model_id': task_info['model_id'],\n"," 'prompt_path': task_info['prompt_path'],\n"," 'prompt_text': task_info['prompt_text'],\n"," 'response_text': str(task_info['response_text'])\n"," }\n","\n"," output_file = os.path.join(output_dir, f'{task_info[\"record_id\"]}.json')\n"," with open(output_file, 'w') as f:\n"," json.dump(serializable_task_info, f, indent=4)\n","\n"," def process_papers(self, config: ExperimentConfig, run_range: range = range(1, 3)):\n"," \"\"\"Process papers for given experiment configuration\"\"\"\n"," input_dir = os.path.join(config.base_dir, 'inputs')\n"," papers = [f.replace('.json', '') for f in os.listdir(input_dir) if f.endswith('.json')]\n","\n"," self.logger.info(f\"Starting processing {len(papers)} papers for {config.name}\")\n","\n"," for run_id in run_range:\n"," self.logger.info(f\"Starting run {run_id + 1}\")\n"," for i, paper in enumerate(papers, 1):\n"," self.logger.info(f\"Processing paper {i}/{len(papers)} in run {run_id + 1}\")\n"," self._process_single_paper(paper, config, run_id)\n","\n"," def _process_single_paper(self, paper: str, config: ExperimentConfig, run_id: int):\n"," \"\"\"Process a single paper\"\"\"\n"," try:\n"," task_info = prepare_task_for_paper(paper, config, self.model_path)\n","\n"," if len(task_info['prompt_text'].split()) > 128000:\n"," raise ValueError(\"Input text exceeds token limit\")\n","\n"," response = self._make_api_call([{\n"," \"role\": \"user\",\n"," \"content\": task_info['prompt_text']\n"," }])\n","\n"," task_info['response_text'] = response\n"," self._save_result(task_info, config.inference_dir, run_id, success=True)\n"," time.sleep(2) # Rate limiting\n","\n"," except Exception as e:\n"," self.logger.error(f\"Error processing paper {paper}: {str(e)}\")\n"," task_info['response_text'] = str(e)\n"," self._save_result(task_info, config.inference_dir, run_id, success=False)\n"," time.sleep(2)"],"metadata":{"id":"vslgWfVSHQR0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title Main Execution\n","def main():\n"," \"\"\"Main execution function\"\"\"\n"," experiment_manager = ExperimentManager()\n","\n"," processor = PaperProcessor(\n"," api_key=API_KEY,\n"," model_path=MODEL_PATH\n"," )\n","\n"," # Select experiment type\n"," experiment_type = ExperimentType.DFT # CHANGE THIS to process different experiments\n"," config = experiment_manager.get_config(experiment_type)\n","\n"," processor.process_papers(config)\n","\n","if __name__ == \"__main__\":\n"," main()"],"metadata":{"collapsed":true,"id":"GVSZxHWAHSRm"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["For example, for running the DFT task, you need to run the following cell:"],"metadata":{"id":"p9cYh-ffKGR5"}},{"cell_type":"code","source":["experiment_manager = ExperimentManager()\n","\n","processor = PaperProcessor(\n"," api_key=\"YOUR_API_KEY\",\n"," model_path=MODEL_PATH\n",")\n","\n","experiment_type = ExperimentType.DFT\n","config = experiment_manager.get_config(experiment_type)\n","\n","print(\"Selected Configuration:\")\n","print(f\"Base Directory: {config.base_dir}\")\n","print(f\"Inference Directory: {config.inference_dir}\")\n","print(f\"Prompt Path: {config.prompt_path}\")\n","\n","processor.process_papers(config)"],"metadata":{"id":"1V5ipuLUKQ6n"},"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /prompts/mat_paper_to_property_1_shot.txt: -------------------------------------------------------------------------------- 1 | You are a materials scientist. Your goal is to find and extract all numeric values or numeric value ranges of material properties mentioned or tabulated in a given paper text. 2 | 3 | The output should be in json format: 4 | [ 5 | { 6 | "material": "", 7 | "material_descriptor": "", 8 | "material_source_passage": "", 9 | "material_source_table": "", 10 | "property_name": "", 11 | "property_descriptor": "", 12 | "property_source_passage": "", 13 | "property_source_table": "", 14 | "low_value" : "", 15 | "high_value": "", 16 | "value_units": "", 17 | "value_source_passage": "", 18 | "value_source_table": "", 19 | }, ... 20 | ] 21 | 22 | There are some certain rules you need to follow: 23 | 1. Be thorough! Don't miss even a single numeric value. 24 | 2. Avoid null low_value and high_value. Simply skip the entry if you cannot extract the numeric value. 25 | 3. If a value is only extractable from a figure, not text or table, simply skip that entry. 26 | 4. Make sure whatever low_value and high_value actually exists in the sourcepassage or refered table. 27 | 5. If one passage contains multiple materials or properties, record all of them as separate entries. 28 | 6. If different values are mentioned for the same material property, record all of them as separate entries. 29 | 7. Material, property, and value may or may not be mentioned in the same passage. 30 | 31 | Examples of the output is as follows: 32 | Example text: 33 | Correlation of Raman and X-ray diffraction measurements of annealed pulsed laser deposited ZnO thin films 34 | 35 | abstract 36 | Abstract Raman spectroscopy, X-ray diffractometry and atomic force microscopy have been used to characterise ZnO thin films grown by pulsed laser deposition as a function of the post-growth annealing temperature. The results show substantial enhancement and broadening of certain Raman features which correlate excellently with the change in width of the X-ray diffraction peaks. The 570 cm −1 Raman feature showed pronounced asymmetry and enhanced intensity in the unannealed sample. An increase in grain size observed after subsequent annealing produced a substantial reduction in both the asymmetry and intensity of this peak. Our experimental data suggest that electric fields, due to charge trapping at grain boundaries, in conjunction with localised and surface phonon modes are the cause of the intensity enhancement and asymmetry of this feature. 37 | 38 | Introduction 39 | Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing. ZnO is normally n-type, but p-type ZnO has been grown which has contributed significantly to the fabrication of UV / blue LED's and laser diodes. Recently ZnO nanowires have shown UV lasing under optical pumping, and fabrication of a homostructural ZnO p-n junction has been reported [see e.g. ref. [1] [2] [3] . 40 | In this study Raman Spectroscopy and x-ray diffraction (XRD) were employed to characterise thin films grown using pulsed laser deposition (PLD) and annealed in situ at various temperatures. Our data allow us to firmly identify the mechanism responsible for enhancement of the 570 cm -1 Raman feature observed in polycrystalline samples. 41 | 42 | Experimental Details 43 | ZnO films were grown on (0001) sapphire substrates by PLD using a 10 Hz pulsed KrF excimer laser (λ=248 nm). The fluence on target was set at 1.7 J/cm 2 for all samples. A ZnO ceramic target (99.99%) was used throughout. The target to substrate distance was ~ 4 cm. The thin films were grown in an O 2 (purity 99.99%) pressure of 0.3 mbar and the substrate temperature was maintained at 400°C during growth. Typically the films were 150-200 nm, giving a deposition rate of 0.025nm/pulse. The films were subsequently annealed in O 2 (0.3 mbar) between 400°C and 600°C (see Table 1 ) in the growth chamber immediately after deposition. SEM data show that the films are continuous and show no evidence of porosity. The crystal structure and quality of the samples were investigated by XRD in the θ-2θ mode (Siemens D500 using Cu K α radiation). Raman scattering measurements were performed using a micro-Raman spectrometer equipped with a CCD detector. Raman spectra were excited with 1.96 eV photons from a He-Ne laser (λ~632 nm) or, for the resonance excitation measurements, with 3.82 eV photons from a UV laser (λ=325 nm), in both cases using back scattering geometry. The He-Ne laser beam of 5-6 mW was focussed on the sample surface to a spot of diameter ~ 10 μm. 44 | Atomic force microscopy (AFM) measurements were made using a commercial AFM in contact mode operation. identical within the limits of our experimental accuracy (± 0.001 nm). The average grain size (parallel to the (0002) direction) of the ZnO films can be estimated from the full width at half maximum (FWHM) of the (0002) peak using Scherrer's relation [5] (see Table 1 ). The range of annealing temperatures used in our work coincides with those known to lead to major grain growth [6] [7] . The annealed samples show a decrease of (0002) peak FWHM and consequent increase in grain size with increasing annealing temperature. These results are supported by AFM imaging of the samples (figure 2), where measurements of the lateral grain sizes are 135 ± 40 nm for the sample annealed at 400 0 C and 180 ± 50 nm for the sample annealed at 500 0 C. The annealing process clearly produces a recovery of the crystal structure and increase of the grain size. We note that in our experimental conditions, the increase in grain size is achieved after annealing for relatively short times. Such an effect was also observed in the case of magnetron sputtered ZnO [8] . 45 | The unpolarized Raman spectra (with He-Ne excitation) of the ZnO thin films are shown in figure 3 as a function of annealing temperature. The peaks at 418 cm -1 and 751 cm -1 are due to scattering from the sapphire substrate. The E 2 437cm -1 peak, characteristic of the wurtzite lattice, can be seen in all samples. The A 1 (TO) mode at 379 cm -1 is apparent in samples c and d. However, the Raman spectra are dominated by the longitudinal optical vibration at ~570cm -1 . This band is attributed to the E 1 (LO) mode [9] . The intensity and FWHM of the 570 cm -1 peak appears anomalously high for the unannealed sample and the sample (b) which was annealed at 400 0 C, but it is reduced dramatically (by a factor of ~100 relative to the sapphire features) with increasing annealing temperature. In addition, this feature shows a marked asymmetry which also decreases with increasing annealing temperature. Raman spectra were also measured with UV excitation λ=325 nm (or 3.82eV) and are shown in figure 4 . The spectra from all the samples are similar, that is the strong band at 570 cm -1 was observed below the 800 cm -1 spectral region whereas a 2LO vibrational mode at ~1150 cm -1 was recorded in the higher spectral range as reported by several authors [10] [11] . The Raman peak at 570 cm -1 is symmetric for all the PLD-grown ZnO samples when the excitation is at or above resonance. 46 | 47 | Discussion 48 | We observe a clear correlation between the XRD data and the corresponding nonresonant Raman spectra as a function of annealing temperature. As the average grain size increases, we observe a reduction in the relative intensity and asymmetry of the 570 cm -1 band in the Raman spectra. The intensity and asymmetry of the LO mode at 570cm -1 has been widely discussed in the literature. Explanations for this structure include resonance enhancement due to impurity levels in the band gap [12] , contributions from both the A 1 (LO) and E 1 (LO) modes due to random crystallite orientation [12] or a combination of electric field induced (EFI) Raman enhancement and one of the following mechanisms (i) coupled phonon-plasmon scattering or (ii) localised interface and/or surface phonon modes [13] . 49 | The absence of the characteristic 2 nd , 3 rd order scattering in our spectra rules out the resonant enhancement by levels in the gap. These are observed in the spectra of the samples under UV excitation as expected. Contributions from both the A 1 (LO) and E 1 (LO) modes due to random crystallite orientation can also be ruled out for the reasons given by Exarhos and Sharma [12] , namely that, after annealing the XRD results remain essentially the same, except for a reduction in the FWHM of the peaks, with no evidence for substantial crystallite reorientation. In addition this cannot explain the abnormally high intensity of the 570 cm -1 band in samples (a) and (b) in particular. Phonon-plasmon coupling is known to produce significant line shifts and broadening in Raman spectra. 50 | However, for highly polycrystalline material, the contribution of the lower frequency L -mode is generally absent, and a shift and broadening to higher energy of the L + mode is observed. This behaviour is not seen in our samples, hence we believe that the line broadening we observe is unlikely to be due to phonon-plasmon coupling. 51 | The most consistent explanation of the behaviour of our Raman data is in terms of EFI enhancement (via charge trapping at grain boundaries) of the 570cm -1 feature [13] . This enhancement effect in conjunction with the presence of localised/surface phonon modes, which arise due to the small grain size, accounts for both the intensity and asymmetry of the peak in the unannealed sample [13] . Surface phonon modes have been reported at ~550cm -1 in ZnO [14] , which is close to the low energy side of the broad LO mode. In the annealed samples, the increase in the grain size reduces the grain boundary density and hence the effects of charge trapping at grain boundaries. Consequently the EFI enhancement is substantially reduced, in addition to elimination of the surface/interface modes. The importance of band bending at grain boundaries in polycrystalline samples and the presence of strong electric fields has also been found in studies of varistor action in ZnO [15] and of the green luminescence mechanism [16] . Similar enhancement and asymmetry effects have also been observed in the Raman scattering of ZnO containing gold colloids [17] . These were attributed to a surface enhancement effect caused by anomalously large electric fields due to the colloid plasmon resonance. Thus, our data appear to support the model proposed in ref. [13] . This suggests that the enhancement and sample. An increase in the grain size and reduction in the electric field intensity due to annealing lead to a decrease in the intensity and asymmetry of the LO mode. 52 | Tables: Table 1 : 53 | Growth conditions and results of x-ray analysis for our PLD-grown ZnO thin films. 54 | 55 | Sample 56 | Annealing temp. Raman scattering intensity (arb. units) 57 | Raman shift (cm ) 58 | 59 | 60 | Example output: '[{'material': 'ZnO', 'material_descriptor': 'II-VI semiconductor\nwurzite lattice', 'material_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'material_source_table': ' ', 'property_name': 'direct bandgap', 'property_descriptor': ' ', 'property_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'property_source_table': ' ', 'low_value': '3.3', 'high_value': '3.3', 'value_units': 'eV', 'value_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'value_source_table': ' '}, {'material': 'ZnO', 'material_descriptor': 'II-VI semiconductor\nwurzite lattice', 'material_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'material_source_table': ' ', 'property_name': 'exciton binding energy', 'property_descriptor': ' ', 'property_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'property_source_table': ' ', 'low_value': '60', 'high_value': ' ', 'value_units': 'meV', 'value_source_passage': 'Recent years have seen a renewed interest in the II-VI semiconductor ZnO (direct bandgap ~3.3 eV). ZnO crystallises in the wurzite lattice, is optically transparent with a large exciton binding energy of 60 meV, and is capable of UV emission and lasing.', 'value_source_table': ' '}, {'material': 'ZnO', 'material_descriptor': 'thin films\npulsed laser deposition\nannealed in O2 atmosphere between 400°C and 600°C\ngrown on (0001) sapphire substrates\npolycrystalline', 'material_source_passage': 'ZnO films were grown on (0001) sapphire substrates by PLD using a 10 Hz pulsed KrF excimer laser (λ=248 nm). The fluence on target was set at 1.7 J/cm2 for all samples. A ZnO ceramic target (99.99%) was used throughout. The target to substrate distance was ~ 4 cm. The thin films were grown in an O2 (purity 99.99%) pressure of 0.3 mbar and the substrate temperature was maintained at 400°C during growth. Typically the films were 150-200 nm, giving a deposition rate of 0.025nm/pulse. The films were subsequently annealed in O2 (0.3 mbar) between 400°C and 600°C (see Table 1) in the growth chamber immediately after deposition.', 'material_source_table': ' ', 'property_name': 'lattice parameter\nc-axis lattice constant', 'property_descriptor': 'X-ray diffraction\nXRD', 'property_source_passage': 'The lattice parameters of the ZnO thin films perpendicular to the substrate can be calculated from the diffraction angles corresponding to the (1010) and (0002) planes. The intensity of the peak is small compared to the (0002) peak intensity due to the high degree of c-axis orientation. The average values of c- and a-axis lattice constants of our PLD samples are 0.518 nm and 0.330 nm respectively and the values for all samples were identical within the limits of our experimental accuracy (± 0.001 nm).', 'property_source_table': ' ', 'low_value': '0.518', 'high_value': '0.518', 'value_units': 'nm', 'value_source_passage': 'The lattice parameters of the ZnO thin films perpendicular to the substrate can be calculated from the diffraction angles corresponding to the (1010) and (0002) planes. The intensity of the peak is small compared to the (0002) peak intensity due to the high degree of c-axis orientation. The average values of c- and a-axis lattice constants of our PLD samples are 0.518 nm and 0.330 nm respectively and the values for all samples were identical within the limits of our experimental accuracy (± 0.001 nm).', 'value_source_table': ' '}, {'material': 'ZnO', 'material_descriptor': 'thin films\npulsed laser deposition\nannealed in O2 atmosphere between 400°C and 600°C\ngrown on (0001) sapphire substrates\npolycrystalline', 'material_source_passage': 'ZnO films were grown on (0001) sapphire substrates by PLD using a 10 Hz pulsed KrF excimer laser (λ=248 nm). The fluence on target was set at 1.7 J/cm2 for all samples. A ZnO ceramic target (99.99%) was used throughout. The target to substrate distance was ~ 4 cm. The thin films were grown in an O2 (purity 99.99%) pressure of 0.3 mbar and the substrate temperature was maintained at 400°C during growth. Typically the films were 150-200 nm, giving a deposition rate of 0.025nm/pulse. The films were subsequently annealed in O2 (0.3 mbar) between 400°C and 600°C (see Table 1) in the growth chamber immediately after deposition.', 'material_source_table': ' ', 'property_name': 'lattice parameter\na-axis lattice constant', 'property_descriptor': 'X-ray diffraction\nXRD', 'property_source_passage': 'The lattice parameters of the ZnO thin films perpendicular to the substrate can be calculated from the diffraction angles corresponding to the (1010) and (0002) planes. The intensity of the peak is small compared to the (0002) peak intensity due to the high degree of c-axis orientation. The average values of c- and a-axis lattice constants of our PLD samples are 0.518 nm and 0.330 nm respectively and the values for all samples were identical within the limits of our experimental accuracy (± 0.001 nm).', 'property_source_table': ' ', 'low_value': '0.33', 'high_value': '0.33', 'value_units': 'nm', 'value_source_passage': 'The lattice parameters of the ZnO thin films perpendicular to the substrate can be calculated from the diffraction angles corresponding to the (1010) and (0002) planes. The intensity of the peak is small compared to the (0002) peak intensity due to the high degree of c-axis orientation. The average values of c- and a-axis lattice constants of our PLD samples are 0.518 nm and 0.330 nm respectively and the values for all samples were identical within the limits of our experimental accuracy (± 0.001 nm).', 'value_source_table': ' '}]' 61 | ------------------------------ 62 | This below is the full text from a paper in latex format published in a materials science journal. 63 | Excerpt: 64 | Full text of the paper: 65 | {{text}} 66 | ------------------------------ 67 | With the above excerpt, the goal is to extract numeric values or numeric value ranges of material properties from the above paper. 68 | 69 | Please only output your answer in the exact format as shown above without any prefix. 70 | Output: 71 | -------------------------------------------------------------------------------- /colabs/curie_inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "id": "QZi0OF2PukmQ", 8 | "colab": { 9 | "base_uri": "https://localhost:8080/" 10 | }, 11 | "outputId": "f0abe8c4-42f3-42d2-820d-9cc7a45b4238" 12 | }, 13 | "outputs": [ 14 | { 15 | "output_type": "stream", 16 | "name": "stdout", 17 | "text": [ 18 | "Requirement already satisfied: google-ai-generativelanguage in /usr/local/lib/python3.10/dist-packages (0.6.6)\n", 19 | "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1 in /usr/local/lib/python3.10/dist-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (2.19.2)\n", 20 | "Requirement already satisfied: google-auth!=2.24.0,!=2.25.0,<3.0.0dev,>=2.14.1 in /usr/local/lib/python3.10/dist-packages (from google-ai-generativelanguage) (2.27.0)\n", 21 | "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from google-ai-generativelanguage) (1.24.0)\n", 22 | "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5 in /usr/local/lib/python3.10/dist-packages (from google-ai-generativelanguage) (3.20.3)\n", 23 | "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (1.65.0)\n", 24 | "Requirement already satisfied: requests<3.0.0.dev0,>=2.18.0 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (2.32.3)\n", 25 | "Requirement already satisfied: grpcio<2.0dev,>=1.33.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (1.64.1)\n", 26 | "Requirement already satisfied: grpcio-status<2.0.dev0,>=1.33.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (1.48.2)\n", 27 | "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0dev,>=2.14.1->google-ai-generativelanguage) (5.5.0)\n", 28 | "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0dev,>=2.14.1->google-ai-generativelanguage) (0.4.1)\n", 29 | "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0dev,>=2.14.1->google-ai-generativelanguage) (4.9)\n", 30 | "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth!=2.24.0,!=2.25.0,<3.0.0dev,>=2.14.1->google-ai-generativelanguage) (0.6.1)\n", 31 | "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (3.3.2)\n", 32 | "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (3.10)\n", 33 | "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (2.2.3)\n", 34 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage) (2024.8.30)\n", 35 | "Requirement already satisfied: google-generativeai in /usr/local/lib/python3.10/dist-packages (0.7.2)\n", 36 | "Requirement already satisfied: google-ai-generativelanguage==0.6.6 in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (0.6.6)\n", 37 | "Requirement already satisfied: google-api-core in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (2.19.2)\n", 38 | "Requirement already satisfied: google-api-python-client in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (2.137.0)\n", 39 | "Requirement already satisfied: google-auth>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (2.27.0)\n", 40 | "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (3.20.3)\n", 41 | "Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (2.9.2)\n", 42 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (4.66.5)\n", 43 | "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from google-generativeai) (4.12.2)\n", 44 | "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from google-ai-generativelanguage==0.6.6->google-generativeai) (1.24.0)\n", 45 | "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core->google-generativeai) (1.65.0)\n", 46 | "Requirement already satisfied: requests<3.0.0.dev0,>=2.18.0 in /usr/local/lib/python3.10/dist-packages (from google-api-core->google-generativeai) (2.32.3)\n", 47 | "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth>=2.15.0->google-generativeai) (5.5.0)\n", 48 | "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth>=2.15.0->google-generativeai) (0.4.1)\n", 49 | "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth>=2.15.0->google-generativeai) (4.9)\n", 50 | "Requirement already satisfied: httplib2<1.dev0,>=0.19.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client->google-generativeai) (0.22.0)\n", 51 | "Requirement already satisfied: google-auth-httplib2<1.0.0,>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client->google-generativeai) (0.2.0)\n", 52 | "Requirement already satisfied: uritemplate<5,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client->google-generativeai) (4.1.1)\n", 53 | "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic->google-generativeai) (0.7.0)\n", 54 | "Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic->google-generativeai) (2.23.4)\n", 55 | "Requirement already satisfied: grpcio<2.0dev,>=1.33.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage==0.6.6->google-generativeai) (1.64.1)\n", 56 | "Requirement already satisfied: grpcio-status<2.0.dev0,>=1.33.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-ai-generativelanguage==0.6.6->google-generativeai) (1.48.2)\n", 57 | "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from httplib2<1.dev0,>=0.19.0->google-api-python-client->google-generativeai) (3.1.4)\n", 58 | "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth>=2.15.0->google-generativeai) (0.6.1)\n", 59 | "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core->google-generativeai) (3.3.2)\n", 60 | "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core->google-generativeai) (3.10)\n", 61 | "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core->google-generativeai) (2.2.3)\n", 62 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core->google-generativeai) (2024.8.30)\n" 63 | ] 64 | } 65 | ], 66 | "source": [ 67 | "!pip install google-ai-generativelanguage\n", 68 | "!pip install google-generativeai" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "id": "kP6dd87I78Yq", 76 | "colab": { 77 | "base_uri": "https://localhost:8080/" 78 | }, 79 | "outputId": "fb190068-2a67-47c2-8319-1c533abf6f6b" 80 | }, 81 | "outputs": [ 82 | { 83 | "output_type": "stream", 84 | "name": "stdout", 85 | "text": [ 86 | "Mounted at /content/drive\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "import google.generativeai as genai\n", 92 | "import inspect\n", 93 | "import json\n", 94 | "import os\n", 95 | "import sys\n", 96 | "\n", 97 | "from google.colab import drive\n", 98 | "from google.colab import userdata\n", 99 | "from io import StringIO\n", 100 | "\n", 101 | "drive.mount('/content/drive')" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "id": "ElblcG7MfOM4" 108 | }, 109 | "source": [ 110 | "# Load Papers" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "id": "SNezBvLsPHh1", 118 | "colab": { 119 | "base_uri": "https://localhost:8080/" 120 | }, 121 | "outputId": "e8379076-2c02-40c1-a956-0f6afa0bb78b", 122 | "cellView": "form" 123 | }, 124 | "outputs": [ 125 | { 126 | "output_type": "stream", 127 | "name": "stdout", 128 | "text": [ 129 | "The result of your inference is going to be saved in /content/drive/MyDrive/benchmarks/public_release/inference/qecc_65/describe_code_in_paper/gemini/run_0/success/\n" 130 | ] 131 | } 132 | ], 133 | "source": [ 134 | "# @title Task Specific\n", 135 | "\n", 136 | "BASEDIR = '/content/drive/MyDrive/benchmarks/public_release' # @param {type:\"string\"}\n", 137 | "assert(os.path.exists(BASEDIR))\n", 138 | "\n", 139 | "TASK_NAME = \"qecc_65\" # @param {type:\"string\"} # specify valid names\n", 140 | "TASK_DIR = f\"{BASEDIR}/data/{TASK_NAME}/\"\n", 141 | "\n", 142 | "PROMPT_NAME = \"describe_code_in_paper\" # @param {type:\"string\"}\n", 143 | "PROMPT_FULL_NAME = PROMPT_NAME+\".txt\"\n", 144 | "\n", 145 | "DIRPATH = f'{BASEDIR}/inference'\n", 146 | "INPUT_DIR= f'{TASK_DIR}/inputs/'\n", 147 | "\n", 148 | "INFERENCE_TRIAL = \"run_0\" # @param {type:\"string\"}\n", 149 | "EXP_DIR = f\"{DIRPATH}/{TASK_NAME}/{PROMPT_NAME}/gemini/{INFERENCE_TRIAL}/success/\"\n", 150 | "\n", 151 | "MY_API_KEY=None # @param\n", 152 | "\n", 153 | "\n", 154 | "\n", 155 | "\n", 156 | "if not os.path.exists(EXP_DIR):\n", 157 | " os.makedirs(EXP_DIR)\n", 158 | " print(f\"Created directory {EXP_DIR}\")\n", 159 | "\n", 160 | "print(f\"The result of your inference is going to be saved in {EXP_DIR}\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "id": "7HGeHjjdyrIP" 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "def get_paper_list(inputdir):\n", 172 | " files = os.listdir(inputdir)\n", 173 | " papers = []\n", 174 | " for f in files:\n", 175 | " if f.endswith('.json'):\n", 176 | " papers.append(f[:f.rindex(\".json\")])\n", 177 | " return papers" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "id": "mMCDf8Ssym3G", 185 | "colab": { 186 | "base_uri": "https://localhost:8080/" 187 | }, 188 | "outputId": "8e6ac0fe-08c1-474b-9524-1b77285bdbce" 189 | }, 190 | "outputs": [ 191 | { 192 | "output_type": "stream", 193 | "name": "stdout", 194 | "text": [ 195 | "65 papers are loaded. \n", 196 | "Here is the list: \n", 197 | "['1502.05267', 'quant-ph_9705052', '1802.07419', '1712.07666', '1709.08658', '2003.02717', '2203.16534', '1503.08800', 'quant-ph_9711049', '2209.11405', '2311.08653', '2007.09154', 'quant-ph_9703002', '1709.04471', '2007.12152', '1910.10746', '1801.05897', '2107.02194', '2303.04798', '1903.03937', '2311.07679', '2201.07802', '1707.02308', 'quant-ph_9810055', '2106.02649', 'cond-mat_0607736', '1503.06237', '2210.16957', '1505.02576', '2008.09495', '2311.13040', '1703.02973', '1809.09801', '1907.09528', '2306.11621', '2009.03921', '2402.07476', '2312.04522', '2212.09935', '2309.16503', '2010.06628', '1906.11394', '1603.04442', 'quant-ph_0008040', 'quant-ph_0502086', 'quant-ph_9711021', 'cond-mat_0010440', '2303.02432', '2210.10808', 'quant-ph_0702075', 'quant-ph_0701020', '1602.00008', '1710.04631', 'quant-ph_0605138', 'quant-ph_0210097', '2110.11510', '1604.07925', 'quant-ph_9906114', '2112.01446', '1501.07779', '2203.00103', 'cs_0509062', 'cond-mat_9707273', '1805.01474', '1708.08474']\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "papers = get_paper_list(INPUT_DIR)\n", 203 | "\n", 204 | "print(f\"{len(papers)} papers are loaded. \\nHere is the list: \\n{papers}\")" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": { 210 | "id": "chPK2r8CfRcC" 211 | }, 212 | "source": [ 213 | "# Run inference" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "id": "HssNqrtXcohP" 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "#@title Helper Functions\n", 225 | "\n", 226 | "def load_prompt(filename: str) -> str:\n", 227 | " with open(filename, 'r') as file:\n", 228 | " prompt = file.read()\n", 229 | " return prompt.strip()\n", 230 | "\n", 231 | "\n", 232 | "def prepare_task_for_paper(paper: str,\n", 233 | " prompt_path: str,\n", 234 | " )-> dict[str, str]:\n", 235 | " paper_input = f'{INPUT_DIR}/{paper}.json'\n", 236 | " inputs = json.load(open(paper_input, 'r'))\n", 237 | " raw_prompt = load_prompt(prompt_path)\n", 238 | " prompt = raw_prompt.replace('{{text}}', paper_input)\n", 239 | " return {\n", 240 | " 'record_id': inputs['record_id'],\n", 241 | " 'prompt_text': prompt,\n", 242 | " 'response_text': ''\n", 243 | " }\n", 244 | "\n", 245 | "\n", 246 | "def query_model(query_prompt: str,\n", 247 | " model_name: str = 'gemini-1.5-pro-latest'\n", 248 | " ) -> str:\n", 249 | " model = genai.GenerativeModel(model_name=model_name)\n", 250 | " response = model.generate_content(query_prompt)\n", 251 | " return response.text\n", 252 | "\n", 253 | "\n", 254 | "def run_eval_loop(paper_list,\n", 255 | " results_dir: str,\n", 256 | " ):\n", 257 | " genai.configure(api_key=MY_API_KEY)\n", 258 | " for PAPER in paper_list:\n", 259 | " print(PAPER)\n", 260 | " outpath = f'{results_dir}/{PAPER}.json'\n", 261 | " if os.path.exists(outpath):\n", 262 | " print(f'Skipping since result for {PAPER} already exists.')\n", 263 | " else:\n", 264 | " out_dict = prepare_task_for_paper(\n", 265 | " paper=PAPER,\n", 266 | " prompt_path=f'{BASEDIR}/prompts/{PROMPT_FULL_NAME}'\n", 267 | " )\n", 268 | "\n", 269 | " out_dict['response_text'] = query_model(out_dict['prompt_text'])\n", 270 | " json.dump(out_dict, open(outpath, 'w'))\n", 271 | " return out_dict\n" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "source": [ 277 | "run_eval_loop(papers, EXP_DIR)" 278 | ], 279 | "metadata": { 280 | "id": "KHA8EXitJnqK", 281 | "colab": { 282 | "base_uri": "https://localhost:8080/", 283 | "height": 1000 284 | }, 285 | "outputId": "59fd2af7-8328-4bfc-855c-fd278957b36f" 286 | }, 287 | "execution_count": null, 288 | "outputs": [ 289 | { 290 | "output_type": "stream", 291 | "name": "stdout", 292 | "text": [ 293 | "1502.05267\n", 294 | "quant-ph_9705052\n", 295 | "1802.07419\n", 296 | "1712.07666\n", 297 | "1709.08658\n", 298 | "2003.02717\n", 299 | "2203.16534\n", 300 | "1503.08800\n", 301 | "quant-ph_9711049\n", 302 | "2209.11405\n", 303 | "2311.08653\n", 304 | "2007.09154\n", 305 | "quant-ph_9703002\n", 306 | "1709.04471\n", 307 | "2007.12152\n", 308 | "1910.10746\n", 309 | "1801.05897\n", 310 | "2107.02194\n", 311 | "2303.04798\n", 312 | "1903.03937\n", 313 | "2311.07679\n", 314 | "2201.07802\n", 315 | "1707.02308\n", 316 | "quant-ph_9810055\n", 317 | "2106.02649\n", 318 | "cond-mat_0607736\n", 319 | "1503.06237\n", 320 | "2210.16957\n", 321 | "1505.02576\n", 322 | "2008.09495\n", 323 | "2311.13040\n", 324 | "1703.02973\n", 325 | "1809.09801\n", 326 | "1907.09528\n", 327 | "2306.11621\n", 328 | "2009.03921\n", 329 | "2402.07476\n", 330 | "2312.04522\n", 331 | "2212.09935\n", 332 | "2309.16503\n", 333 | "2010.06628\n", 334 | "1906.11394\n", 335 | "1603.04442\n", 336 | "quant-ph_0008040\n", 337 | "quant-ph_0502086\n", 338 | "quant-ph_9711021\n", 339 | "cond-mat_0010440\n", 340 | "2303.02432\n", 341 | "2210.10808\n", 342 | "quant-ph_0702075\n", 343 | "quant-ph_0701020\n", 344 | "1602.00008\n", 345 | "1710.04631\n", 346 | "quant-ph_0605138\n", 347 | "quant-ph_0210097\n", 348 | "2110.11510\n", 349 | "1604.07925\n", 350 | "quant-ph_9906114\n", 351 | "2112.01446\n", 352 | "1501.07779\n", 353 | "2203.00103\n", 354 | "cs_0509062\n", 355 | "cond-mat_9707273\n", 356 | "1805.01474\n", 357 | "1708.08474\n" 358 | ] 359 | }, 360 | { 361 | "output_type": "execute_result", 362 | "data": { 363 | "text/plain": [ 364 | "{'record_id': '1708.08474',\n", 365 | " 'prompt_text': 'Fill in a YAML file for the code described in the attached paper according to the prescription defined in the YAML template that starts on the next paragraph. Fields are to be filled only if they are directly relevant to the code introduced in the paper. Be sure to extract any quantitative data like thresholds and code rates. Above all, be concise! If you cannot explain something technical in detail, do not try to explain it. If something is not detailed in the paper, do not mention it.\\n\\n#######################################################\\n## This is a code entry in the error correction zoo. ##\\n## https://github.com/errorcorrectionzoo ##\\n#######################################################\\n\\n# Use UTF-8 unicode encoding\\n# AMS-TeX commands are rendered inside \\\\( ... \\\\) using MathJaX.\\n# Allowed external bibliographic references are\\n# \\\\cite{arXiv:#.#} or \\\\cite{arXiv:quant-ph/#} (PREFERRED),\\n# \\\\cite{doi:#}, or, as a last resort\\n# \\\\cite{manual:{(enter citation line incl. author and year here)}}\\n# External websites such as code tables, coding theory packages, github pages linked as\\n# \\\\\\\\url{https://example.com/example}\\n# \\\\href{https://example.com/example}{link text}\\n# Internal references to codes are\\n# \\\\hyperref[code:code_id]{link text}\\n# Delete instructional comments when submitting\\n\\n# code id, physical, logical are all lower case\\n# physical or logical are one of the following: bits, q-ary_digits, matrices, rings, reals, spheres, qubits, qudits, galois, oscillators, spins, or categories\\ncode_id: no_spaces_lower_case\\nphysical: qubits\\nlogical: qubits\\n\\n# Only list if the code being described has specific parameters. These are typically in the form of (n,K,d) and [n,k,d] for classical codes, or ((n,K,d)) and [[n,k,d]] for quantum codes.\\nCode_parameter: \\'((2^r-1,r,d))_{6}\\'\\n\\n# Apostrophes are denoted by two apostrophe characters, i.e., \\'\\'\\n# Code title (SINGULAR) + first reference(s) (optional).\\nname: \\'Important-last-name favorite code\\'\\nintroduced: \\'\\\\cite{doi:10.1070/RM1997v052n06ABEH002155}\\'\\n\\n# Anything applicable to a larger parent set of codes (see below) should go in\\n# that entry instead of here.\\ndescription: |\\n First paragraph is a short standalone description, containing no references to figures.\\n\\n Subsequent paragraphs go into (possibly quantitative) details.\\n\\n \\\\subsection{Subsection headings work}\\n \\\\paragraph{And so do paragraphs}\\n# Only add subsections or paragraphs if the paper has long discussions about broad code classes.\\n\\n\\n# Long fields such as this one can be written in other YML formats, such as the one using the pipe symbol\\n# protection: |\\n# text...\\n# more text...\\nprotection: \\'Protects against ... Pauli noise. Approximate code with parameters ... for noise model ... .\\'\\n\\n# This field starts a list of specific labeled subfields; do not leave it empty. If empty, comment out. Also, indentations are important!\\nfeatures:\\n\\n # Do not include this if no specific encoders are mentioned.\\n encoders:\\n - \\'Specific description of a process that makes the states, usually for quantum codes.\\'\\n - \\'Unitary circuit of depth ... \\\\cite{arxiv:old-paper}.\\'\\n - \\'Measurement-based preparation ... with ancilla overhead of ... .\\'\\n - \\'Leave discussion of fault tolerance to fault-tolerance field.\\'\\n\\n # Not all fields are indexed by a dash\\n transversal_gates: \\'Transversal ... gates \\\\cite{doi:ok-paper}. Comment out if doesn\\'\\'t apply.\\'\\n\\n # Do not include this if no specific gates are mentioned.\\n general_gates:\\n - \\'Universal gate set achieved by either additional ... gate.\\'\\n - \\'Magic-state distillation protocols\\'\\n - \\'kth \\\\term{Clifford hierarchy} gates obtained by ... circuits\\'\\n\\n # Do not include this if no specific decoders are mentioned.\\n decoders:\\n - \\'Details about how syndrome measurements are done; discuss overhead, if applicable.\\'\\n - \\'MWPM decoding algorithm \\\\cite{doi:good-paper} with ... overhead.\\'\\n - \\'Just-in-time decoder with ... \\\\cite{arxiv:awesome-paper}.\\'\\n\\n fault_tolerance:\\n - \\'Transversal gates are fault-tolerant w.r.t. ... noise \\\\cite{doi:ok-paper}\\'\\n - \\'Other fault-tolerant gadgets (measurements, encoders, error correcting steps)\\'\\n - \\'Noise-model-preserving gadgets, noise-biased gates, fault-tolerant flag error correction\\'\\n - \\'Pieceable fault tolerance.\\'\\n\\n code_capacity_threshold:\\n - \\'\\\\(1.5%\\\\) error-correction threshold against some noise with *noiseless* decoder of some complexity \\\\cite{arxiv:paper}.\\'\\n\\n threshold:\\n - \\'\\\\(0.3\\\\%\\\\) error-correction threshold ... with *noisy* ... decoder of some complexity \\\\cite{doi:good-paper}.\\'\\n - \\'\\\\(10^{-5}\\\\) computational threshold using concatenated scheme under ... noise with overhead of ... \\'\\n\\n# Include only if specific experimental or real-world realizations are reported.\\nrealizations:\\n # List and explain the different \"domains\" of realizations in list items.\\n - \\'Code used in DVDs \\\\cite{doi:####...}, 5G, etc.\\'\\n - \\'Realized in trapped-ion quantum devices \\\\cite{arXiv:####.#####}, etc.\\'\\n\\n# Only include notes if the specific technical items listed below are included in the paper.\\nnotes:\\n - \\'Bounds on \\\\(n\\\\), \\\\(k\\\\), or \\\\(d\\\\) for this class, unless mentioned in description.\\'\\n - \\'Links to code tables, github, GAP algebra packages, more papers \\\\cite{arXiv:####.#####}.\\'\\n\\n# Include as many direct relations as were mentioned in the paper. The relations below are just examples.\\nrelations:\\n parents:\\n - code_id: code_id1\\n detail: \\'The smallest code family that includes this code that is defined over the same physical space structure or alphabet.\\'\\n cousins:\\n - code_id: code_id2\\n detail: \\'Codes that are directly relevant and described by a property shared by this code.\\'\\n - code_id: code_id3\\n detail: \\'Code family of similar encoding but with different physical space structures (qudit vs. qubit surface code).\\'\\n\\n# Include footer below and change the date to today’s date in the prescribed format\\n# Begin Entry Meta Information\\n_meta:\\n # Change log - most recent first\\n changelog:\\n - user_id: VictorVAlbert\\n date: \\'YYYY-MM-DD\\'\\n\\nHere is the paper\\n\\n/content/drive/MyDrive/benchmarks/public_release/data/qecc_65//inputs//1708.08474.json',\n", 366 | " 'response_text': \"#######################################################\\n## This is a code entry in the error correction zoo. ##\\n## https://github.com/errorcorrectionzoo ##\\n#######################################################\\n\\n# Use UTF-8 unicode encoding\\n# AMS-TeX commands are rendered inside \\\\( ... \\\\) using MathJaX.\\n# Allowed external bibliographic references are\\n# \\\\cite{arXiv:#.#} or \\\\cite{arXiv:quant-ph/#} (PREFERRED),\\n# \\\\cite{doi:#}, or, as a last resort\\n# \\\\cite{manual:{(enter citation line incl. author and year here)}}\\n# External websites such as code tables, coding theory packages, github pages linked as\\n# \\\\\\\\url{https://example.com/example}\\n# \\\\href{https://example.com/example}{link text}\\n# Internal references to codes are\\n# \\\\hyperref[code:code_id]{link text}\\n# Delete instructional comments when submitting\\n\\n# code id, physical, logical are all lower case\\n# physical or logical are one of the following: bits, q-ary_digits, matrices, rings, reals, spheres, qubits, qudits, galois, oscillators, spins, or categories\\ncode_id: trigono2\\nphysical: qubits\\nlogical: qubits\\n\\n# Only list if the code being described has specific parameters. These are typically in the form of (n,K,d) and [n,k,d] for classical codes, or ((n,K,d)) and [[n,k,d]] for quantum codes.\\nCode_parameter: '((2^r-1,r,d))_{6}' \\n\\n# Apostrophes are denoted by two apostrophe characters, i.e., ''\\n# Code title (SINGULAR) + first reference(s) (optional).\\nname: 'Trigonometric code'\\nintroduced: '\\\\cite{doi:10.1070/RM1997v052n06ABEH002155}'\\n\\n# Anything applicable to a larger parent set of codes (see below) should go in\\n# that entry instead of here.\\ndescription: |\\n This family of quantum codes is defined on a ring of \\\\(2^r-1\\\\) qubits, where \\\\(r\\\\) is any integer greater than one.\\n\\n The code has \\\\(r\\\\) logical qubits and is defined in terms of its stabilizer generators, which are products of Pauli \\\\(X\\\\) and \\\\(Z\\\\) operators acting on specific qubits in the ring. The code can be viewed as a generalization of the five-qubit code.\\n\\n\\n# Long fields such as this one can be written in other YML formats, such as the one using the pipe symbol\\n# protection: |\\n# text...\\n# more text...\\n# protection: 'Protects against ... Pauli noise. Approximate code with parameters ... for noise model ... .'\\n\\n# This field starts a list of specific labeled subfields; do not leave it empty. If empty, comment out. Also, indentations are important!\\nfeatures:\\n\\n # Do not include this if no specific encoders are mentioned.\\n # encoders:\\n # - 'Specific description of a process that makes the states, usually for quantum codes.'\\n # - 'Unitary circuit of depth ... \\\\cite{arxiv:old-paper}.'\\n # - 'Measurement-based preparation ... with ancilla overhead of ... .'\\n # - 'Leave discussion of fault tolerance to fault-tolerance field.'\\n\\n # Not all fields are indexed by a dash\\n # transversal_gates: 'Transversal ... gates \\\\cite{doi:ok-paper}. Comment out if doesn''t apply.'\\n\\n # Do not include this if no specific gates are mentioned.\\n # general_gates:\\n # - 'Universal gate set achieved by either additional ... gate.'\\n # - 'Magic-state distillation protocols'\\n # - 'kth \\\\term{Clifford hierarchy} gates obtained by ... circuits'\\n\\n # Do not include this if no specific decoders are mentioned.\\n # decoders:\\n # - 'Details about how syndrome measurements are done; discuss overhead, if applicable.'\\n # - 'MWPM decoding algorithm \\\\cite{doi:good-paper} with ... overhead.'\\n # - 'Just-in-time decoder with ... \\\\cite{arxiv:awesome-paper}.'\\n\\n # fault_tolerance:\\n # - 'Transversal gates are fault-tolerant w.r.t. ... noise \\\\cite{doi:ok-paper}'\\n # - 'Other fault-tolerant gadgets (measurements, encoders, error correcting steps)'\\n # - 'Noise-model-preserving gadgets, noise-biased gates, fault-tolerant flag error correction'\\n # - 'Pieceable fault tolerance.'\\n\\n # code_capacity_threshold:\\n # - '\\\\(1.5%\\\\) error-correction threshold against some noise with *noiseless* decoder of some complexity \\\\cite{arxiv:paper}.'\\n\\n # threshold:\\n # - '\\\\(0.3\\\\%\\\\) error-correction threshold ... with *noisy* ... decoder of some complexity \\\\cite{doi:good-paper}.'\\n # - '\\\\(10^{-5}\\\\) computational threshold using concatenated scheme under ... noise with overhead of ... '\\n\\n# Include only if specific experimental or real-world realizations are reported.\\n# realizations:\\n # - 'Code used in DVDs \\\\cite{doi:####...}, 5G, etc.'\\n # - 'Realized in trapped-ion quantum devices \\\\cite{arXiv:####.#####}, etc.'\\n\\n# Only include notes if the specific technical items listed below are included in the paper.\\n# notes:\\n # - 'Bounds on \\\\(n\\\\), \\\\(k\\\\), or \\\\(d\\\\) for this class, unless mentioned in description.'\\n # - 'Links to code tables, github, GAP algebra packages, more papers \\\\cite{arXiv:####.#####}.'\\n\\n# Include as many direct relations as were mentioned in the paper. The relations below are just examples.\\nrelations:\\n parents:\\n - code_id: additive\\n detail: 'This code is a special type of additive code.'\\n cousins:\\n - code_id: five_qubit\\n detail: 'This code is a generalization of the five-qubit code.'\\n\\n\\n# Include footer below and change the date to today’s date in the prescribed format\\n# Begin Entry Meta Information\\n_meta:\\n # Change log - most recent first\\n changelog:\\n - user_id: VictorVAlbert\\n date: '2023-10-26'\\n\"}" 367 | ] 368 | }, 369 | "metadata": {}, 370 | "execution_count": 8 371 | } 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "source": [], 377 | "metadata": { 378 | "id": "cs7vqT-4P6HJ" 379 | }, 380 | "execution_count": null, 381 | "outputs": [] 382 | } 383 | ], 384 | "metadata": { 385 | "accelerator": "GPU", 386 | "colab": { 387 | "gpuType": "T4", 388 | "provenance": [] 389 | }, 390 | "kernelspec": { 391 | "display_name": "Python 3", 392 | "name": "python3" 393 | }, 394 | "language_info": { 395 | "name": "python" 396 | } 397 | }, 398 | "nbformat": 4, 399 | "nbformat_minor": 0 400 | } -------------------------------------------------------------------------------- /colabs/curie_generate_tables_figures.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"code","source":["!pip install json5"],"metadata":{"id":"jZB-sM04kGdl"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"p1aMxI6bBKfF"},"outputs":[],"source":["import collections\n","\n","import json5\n","import matplotlib.pyplot as plt\n","import numpy as np\n","import pandas as pd\n","import seaborn as sns\n","\n","from google.colab import drive\n","# from google.colab import userdata\n","\n","\n","drive.mount('/content/drive')"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3qSVW0_BNM02"},"outputs":[],"source":["BASEDIR = '/content/drive/Shareddrives/Curie/benchmarks/public_release'\n","meta_results_path = f\"{BASEDIR}/eval_results/up_to_date/meta_results.json\""]},{"cell_type":"code","source":["# Helper functions\n","\n","def extract_metric(x):\n"," \"\"\"Extracts the metric from the name string.\"\"\"\n"," parts = x.name.split('$$')\n"," if len(parts) > 4:\n"," return parts[4]\n"," else:\n"," return 'xxxxx'\n","\n","\n","def plot_example_counts(df):\n"," \"\"\"Create a contingency table (cross-tabulation) to count occurrences.\"\"\"\n"," cross_tab = pd.crosstab(df['task'], df['model'])\n"," plt.figure(figsize=[10, 8])\n"," sns.heatmap(cross_tab, annot=True, cmap='YlGnBu', fmt='.0f')\n"," plt.show()\n","\n","\n","def add_failed_examples(df):\n"," \"\"\"Adds failed examples to the DataFrame.\"\"\"\n"," plot_example_counts(df)\n"," df['task_prompt_metric_example'] = df.apply(\n"," lambda x: '+++'.join([x.task, x.prompt, x.metric, x.example]), axis=1\n"," )\n"," missing_combinations = []\n"," for model in df.model.unique():\n"," for task_prompt_metric_example in df.task_prompt_metric_example.unique():\n"," if not df[\n"," (df['model'] == model)\n"," & (df['task_prompt_metric_example'] == task_prompt_metric_example)\n"," ].size:\n"," task, prompt, metric, example = task_prompt_metric_example.split('+++')\n"," missing_combinations.append({\n"," 'model': model,\n"," 'task': task,\n"," 'prompt': prompt,\n"," 'metric': metric,\n"," 'example': example,\n"," 'value': 0,\n"," })\n","\n"," # Create a DataFrame for the missing combinations\n"," missing_df = pd.DataFrame(missing_combinations)\n","\n"," # Concatenate the original DataFrame with the DataFrame of missing combinations\n"," new_df = pd.concat([df, missing_df], ignore_index=True)\n"," plot_example_counts(new_df)\n"," return new_df"],"metadata":{"id":"eWNSnSiw4KH1"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"RMDYPSxc4RdX"},"source":["# Tables"]},{"cell_type":"markdown","metadata":{"id":"LLJpK1904RdX"},"source":["## Table 1"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"iF8Tyv2V4RdX"},"outputs":[],"source":["TASKS_2_METRICS = {\n"," (\"dft\", \"extract_structure_data_1_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"dft\", \"extract_dft_metadata_1_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"dft\", \"write_code_for_paper_0_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"mpv\", \"mat_paper_to_property_1_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"hfd\", \"derivation_prompt\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"hfe\", \"extract_hamiltonian_0_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"qecc_65\", \"describe_code_in_paper\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"geo\", \"extract_dataset_from_geo_papers_0_shot\"): [\n"," \"rougeLsum\",\n"," \"bert_f1\",\n"," ],\n"," (\"biogr\", \"georeference_image_0_shot\"): [\n"," \"iou\",\n"," ],\n"," (\"pdb\", \"reconstruct_protein_amino_acid_sequence_0_shot\"): [\n"," \"identity_ratio\",\n"," ],\n","}\n","\n","all_results = json5.load(open(meta_results_path, \"r\"))\n","df = pd.json_normalize(all_results, sep=\"$$\").transpose()\n","\n","# Add separe columns for task, model, example, metric, and value\n","df[\"task\"] = df.apply(lambda x: x.name.split(\"$$\")[0], axis=1)\n","df[\"prompt\"] = df.apply(lambda x: x.name.split(\"$$\")[1], axis=1)\n","df[\"model\"] = df.apply(lambda x: x.name.split(\"$$\")[2], axis=1)\n","df[\"example\"] = df.apply(lambda x: x.name.split(\"$$\")[3], axis=1)\n","df[\"metric\"] = df.apply(extract_metric, axis=1)\n","df[\"value\"] = df[0]\n","df[\"task\"] = df[\"task\"].str.replace(\"mpve\", \"mpv\")\n","df[\"task_prompt\"] = df.apply(lambda x: \"_\".join([x.task, x.prompt]), axis=1)\n","df[\"task_prompt_metric\"] = df.apply(\n"," lambda x: \"_\".join([x.task, x.prompt, x.metric]), axis=1\n",")\n","\n","df_len = len(df)\n","\n","# Delete extra tasks, prompts, and metrics\n","task_prompt_metric_keep = []\n","for (task, prompt), metrics in TASKS_2_METRICS.items():\n"," for metric in metrics:\n"," task_prompt_metric_keep.append(\"_\".join([task, prompt, metric]))\n","\n","df = df.drop(df[~df[\"task_prompt_metric\"].isin(task_prompt_metric_keep)].index)\n","print(f\"{df_len-len(df)} rows contain extra tasks and are dropped.\")\n","\n","# Zero non numerical entries\n","df[\"value\"] = df[\"value\"].apply(lambda x: 0 if isinstance(x, str) else x)\n","df = add_failed_examples(df)\n","\n","df_len = len(df)\n","print(f\"The final dataframe has {len(df)} rows.\")\n","\n","grouped = (\n"," df.groupby([\"model\", \"task\", \"metric\"])[\"value\"]\n"," .agg([\"mean\", \"std\"])\n"," .fillna(0)\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Wxlr9-o54RdY"},"outputs":[],"source":["def print_row(model):\n"," metrics_strings = []\n"," for task in [\n"," \"dft\",\n"," \"mpv\",\n"," \"hfd\",\n"," \"hfe\",\n"," \"qecc_65\",\n"," \"geo\",\n"," ]:\n"," for metric in [\"rougeLsum\", \"bert_f1\"]:\n"," try:\n"," val = grouped.loc[(model, task, metric), \"mean\"]\n"," metrics_strings.append(str(round(val, 2)))\n"," except:\n"," print(f'Missing metric: {metric}, task: {task}, model: {model}')\n","\n"," try:\n"," val = grouped.loc[(model, 'biogr', \"iou\"), \"mean\"]\n"," # val = grouped.loc[(model, 'biogr', \"normalized_distance_error\"), \"mean\"]\n"," metrics_strings.append(str(round(val, 2)))\n"," except:\n"," metrics_strings.append(\"-\")\n"," try:\n"," val = grouped.loc[(model, 'pdb', \"identity_ratio\"), \"mean\"]\n"," metrics_strings.append(str(round(val, 2)))\n"," except:\n"," metrics_strings.append(\"-\")\n","\n"," return \"& \" + \" & \".join(metrics_strings) + \" \\\\\\\\\\n\"\n","\n","\n","# Begin LaTeX table\n","latex = \"\"\"\n","\\\\begin{table*}[!t]\n","\n","\\centering\n","\n","\\small\n","\\setlength{\\\\tabcolsep}{4pt}\n","\\\\resizebox{\\\\textwidth}{!}{\\\\begin{tabular}{l | c c | c c |c c | c c | c c | c c | c | c }\n","\n","\\\\toprule\n","\n","\\multirow{2}{*}{\\\\bf Method} & \\multicolumn{2}{c|}{\\\\bf \\data\\ DFT} & \\multicolumn{2}{c|}{\\\\bf \\data\\ MPV} &\n","\\multicolumn{2}{c|}{\\\\bf \\data\\ HFD} &\n","\\multicolumn{2}{c|}{\\\\bf \\data\\ HFE} &\n","\\multicolumn{2}{c|}{\\\\bf \\data\\ QECC} &\n","\\multicolumn{2}{c|}{\\\\bf \\data\\ GEO} &\n","{\\\\bf \\data\\ BIOGR} &\n","{\\\\bf \\data\\ PDB}\n","\\\\\\\\\n","\n","& R-L & B-F1 & R-L & B-F1 & R-L & B-F1 & R-L & B-F1 & R-L & B-F1 & R-L & B-F1 & IoU & ID_{r} \\\\\\\\\n","\n","\\midrule\n","\\multicolumn{15}{c}{\\\\textit{Zero-shot Open Weight LLMs}} \\\\\\\\\n","\\midrule\n","\n","Mixtral % \\cite{Mixtral}\n","\"\"\"\n","latex += print_row('mixtral-gcp')\n","\n","latex += \"Command-R$+$ %\\cite{CommandR+} \\n\"\n","latex += print_row('command-r-plus')\n","\n","latex += \"LongLLaMa %\\cite{LongLLaMa} \\n\"\n","latex += print_row('longllama')\n","\n","latex += \"\"\"\\n\n","\\\\midrule\n","\\multicolumn{15}{c}{\\\\textit{Zero-shot Closed Weight LLMs}} \\\\\\\\\n","\\midrule\n","\n","Gemini 1.0 Pro %\\cite{team2023gemini}\n","\"\"\"\n","latex += print_row('gemini-1.0-pro')\n","latex += \"GPT-4o %\\cite{gpt4orelease} \\n\"\n","latex += print_row('gpt-4o')\n","latex += \"Gemini 1.5 Pro %\\cite{reid2024gemini1.5} \\n\"\n","latex += print_row('gemini-1.5-pro-latest')\n","latex += \"Gemini 1.5 Flash %\\cite{reid2024gemini1.5} \\n\"\n","latex += print_row('gemini-1.5-flash-latest')\n","latex += \"Gemini 2.0 Flash %\\cite{xx} \\n\"\n","latex += print_row('gemini-2.0-flash-latest')\n","latex += \"Claude 3 (Opus) %\\cite{claude3} \\n\"\n","latex += print_row('claude-3-opus-20240229')\n","latex += \"\"\"\n","\\\\bottomrule\n","\\end{tabular}}\n","\\\\vspace{-2mm}\n","\\caption{\\\\textbf{Results comparing performance of all models on all tasks based on automated metrics} R-L: Rouge-L, and B-F1:BertScore-F1. The avg. performance of all 3 DFT tasks are reported under DFT. All models support a context length of 32k or more. BIOGR has multimodal inputs which is unsupported by the chosen open models. Blue highlights the highest values.}\n","\\label{tab:main_results}\n","\\\\vspace{-3mm}\n","\\end{table*}\"\"\"\n","\n","print(latex)"]},{"cell_type":"markdown","metadata":{"id":"0kMKsTVF4RdY"},"source":["## Table 2"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"p7G3kKQu4RdY"},"outputs":[],"source":["TASKS_2_METRICS = {\n"," (\"dft\", \"extract_structure_data_1_shot\"): [\n"," \"LMSim-F1\",\n"," \"LMSim-Pr\",\n"," \"LMSim-Re\",\n"," ],\n"," (\"dft\", \"extract_dft_metadata_1_shot\"): [\n"," \"LMSim-F1\",\n"," \"LMSim-Pr\",\n"," \"LMSim-Re\",\n"," ],\n"," (\"mpv\", \"mat_paper_to_property_1_shot\"): [\n"," \"LMSim-F1\",\n"," \"LMSim-Pr\",\n"," \"LMSim-Re\",\n"," ],\n"," (\"mpv\", \"mat_paper_to_property_1_shot_exclude_trivia\"): [\n"," \"LMSim-F1\",\n"," \"LMSim-Pr\",\n"," \"LMSim-Re\",\n"," ],\n"," (\"mpv\", \"mat_paper_to_property_1_shot_bandgap_refractive\"): [\n"," \"LMSim-F1\",\n"," \"LMSim-Pr\",\n"," \"LMSim-Re\",\n"," ],\n","}\n","\n","all_results = json5.load(open(meta_results_path, \"r\"))\n","df = pd.json_normalize(all_results, sep=\"$$\").transpose()\n","\n","\n","# Add separe columns for task, model, example, metric, and value\n","df[\"task\"] = df.apply(lambda x: x.name.split(\"$$\")[0], axis=1)\n","df[\"prompt\"] = df.apply(lambda x: x.name.split(\"$$\")[1], axis=1)\n","df[\"model\"] = df.apply(lambda x: x.name.split(\"$$\")[2], axis=1)\n","df[\"example\"] = df.apply(lambda x: x.name.split(\"$$\")[3], axis=1)\n","df[\"metric\"] = df.apply(extract_metric, axis=1)\n","df[\"value\"] = df[0]\n","\n","# Update the namings\n","df[\"task\"] = df[\"task\"].str.replace(\"mpve\", \"mpv\")\n","df[\"task_prompt\"] = df.apply(lambda x: \"_\".join([x.task, x.prompt]), axis=1)\n","df[\"task_prompt_metric\"] = df.apply(\n"," lambda x: \"_\".join([x.task, x.prompt, x.metric]), axis=1\n",")\n","\n","df_len = len(df)\n","# Delete extra tasks, prompts, and metrics\n","task_prompt_metric_keep = []\n","for (task, prompt), metrics in TASKS_2_METRICS.items():\n"," for metric in metrics:\n"," task_prompt_metric_keep.append(\"_\".join([task, prompt, metric]))\n","df = df.drop(df[~df[\"task_prompt_metric\"].isin(task_prompt_metric_keep)].index)\n","print(f\"{df_len-len(df)} rows contain extra tasks and are dropped.\")\n","\n","# Zero non numerical entries\n","df[\"value\"] = df[\"value\"].apply(lambda x: 0 if isinstance(x, str) else x)\n","df = add_failed_examples(df)\n","\n","# repeated\n","df[\"task_prompt\"] = df.apply(lambda x: \"_\".join([x.task, x.prompt]), axis=1)\n","df[\"task_prompt_metric\"] = df.apply(\n"," lambda x: \"_\".join([x.task, x.prompt, x.metric]), axis=1\n",")\n","df_len = len(df)\n","print(f\"The final dataframe has {len(df)} rows.\")\n","\n","grouped = (\n"," df.groupby([\"model\", \"task_prompt\", \"metric\"])[\"value\"]\n"," .agg([\"mean\", \"std\"])\n"," .fillna(0)\n",")\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"By3CwsQP4RdY"},"outputs":[],"source":["# Generate a LaTeX table for precision, recall, and F1 score.\n","\n","def print_row(model):\n"," metrics_strings = []\n"," for task_prompt in [\n"," \"dft_extract_structure_data_1_shot\",\n"," \"dft_extract_dft_metadata_1_shot\",\n"," \"mpv_mat_paper_to_property_1_shot\",\n"," \"mpv_mat_paper_to_property_1_shot_exclude_trivia\",\n"," \"mpv_mat_paper_to_property_1_shot_bandgap_refractive\",\n"," ]:\n"," for metric in [\"LMSim-Pr\", \"LMSim-Re\", \"LMSim-F1\"]:\n"," try:\n"," val = grouped.loc[(model, task_prompt, metric), \"mean\"]\n"," metrics_strings.append(str(round(100*val, 2)))\n"," except:\n"," print(f'Missing metric: {metric}, task_prompt: {task_prompt}, model: {model}')\n","\n"," return \"& \" + \" & \".join(metrics_strings) + \" \\\\\\\\\\n\"\n","\n","\n","# Begin LaTeX table\n","latex = \"\"\"\n","\\\\begin{table*}[!th]\n","\\centering\n","\\small\n","\\setlength{\\\\tabcolsep}{4pt}\n","\\\\resizebox{\\\\textwidth}{!}{\\\\begin{tabular}{l | c c c | c c c |c c c | c c c | c c c }\n","\\\\toprule\n","\n","\\multirow{2}{*}{\\\\bf Model} &\n","\\multicolumn{3}{c|}{\\\\bf \\data\\ DFT-S} &\n","\\multicolumn{3}{c|}{\\\\bf \\data\\ DFT-P} &\n","\\multicolumn{3}{c|}{\\\\bf \\data\\ MPV} &\n","\\multicolumn{3}{c|}{\\\\bf \\data\\ MPV-non-trivial} &\n","\\multicolumn{3}{c}{\\\\bf \\data\\ MPV-specific}\n","\\\\\\\\\n","\n","& Pr. & Rec. & F1 & Pr. & Rec. & F1 & Pr. & Rec. & F1 & Pr. & Rec. & F1 & Pr. & Rec. & F1 \\\\\\\\\n","\n","\\midrule\n","\\multicolumn{15}{c}{\\\\textit{Zero-shot Open Weight LLMs}} \\\\\\\\\n","\\midrule\n","Mixtral %\\cite{Mixtral}\n","\"\"\"\n","latex += print_row('mixtral-gcp')\n","\n","latex += \"Command-R$+$ %\\cite{CommandR+} \\n\"\n","latex += print_row('command-r-plus')\n","\n","latex += \"LongLLaMa %\\cite{LongLLaMa} \\n\"\n","latex += print_row('longllama')\n","\n","latex += \"\"\"\\n\n","\\\\midrule\n","\\multicolumn{15}{c}{\\\\textit{Zero-shot Closed Weight LLMs}} \\\\\\\\\n","\\midrule\n","\n","\n","%Gemini 1.0 Pro \\cite{team2023gemini1.0} \\\\\n","Gemini 1.0 Pro %\\cite{team2023gemini}\n","\"\"\"\n","latex += print_row('gemini-1.0-pro')\n","latex += \"GPT-4o %\\cite{gpt4orelease} \\n\"\n","latex += print_row('gpt-4o')\n","latex += \"Gemini 1.5 Pro %\\cite{reid2024gemini1.5} \\n\"\n","latex += print_row('gemini-1.5-pro-latest')\n","latex += \"Gemini 1.5 Flash %\\cite{reid2024gemini1.5} \\n\"\n","latex += print_row('gemini-1.5-flash-latest')\n","latex += \"Gemini 2.0 Flash \\n\"\n","latex += print_row('gemini-v3-s-thinking')\n","latex += \"Claude 3 (Opus) %\\cite{claude3} \\n\"\n","latex += print_row('claude-3-opus-20240229')\n","latex += \"\"\"\n","\\\\bottomrule\n","\\end{tabular}}\n","\\\\vspace{-1.5mm}\n","\\caption{\\\\textbf{Comparing performance using \\lmsim.} On sub-tasks requiring exhaustive retrieval of information we use \\lmsim \\ based similarity to compute compute F1 scores for finer grained assessment on materials science. We also include 2 ablations for the MPV task where we ask the LLM to retrieve non-trivial or specific property values (refractive index and optical bandgap) for materials. Command-R$+$ responses on MPV papers were incomplete leading to invalid json dictionaries. We find the precision and recall values to match human evaluations on the MPV tasks for Gemini 1.5 pro and GPT-4o.}\n","\\label{tab:matsci_results}\n","\\\\vspace{-2mm}\n","\\end{table*}\"\"\"\n","\n","print(latex)"]},{"cell_type":"markdown","source":["## Table SI\n"],"metadata":{"id":"bHFClGqQeboZ"}},{"cell_type":"code","source":["all_results = json5.load(open(meta_results_path, \"r\"))\n","df = pd.json_normalize(all_results, sep=\"$$\").transpose()\n","\n","# Add separe columns for task, model, example, metric, and value\n","df[\"task\"] = df.apply(lambda x: x.name.split(\"$$\")[0], axis=1)\n","df[\"model\"] = df.apply(lambda x: x.name.split(\"$$\")[2], axis=1)\n","df[\"metric\"] = df.apply(extract_metric, axis=1)\n","df[\"example\"] = df.apply(lambda x: x.name.split(\"$$\")[3], axis=1)\n","df[\"value\"] = df[0].apply(lambda x: 0 if isinstance(x, str) else x)\n","\n","df = df[df['task']=='biogr']\n","\n","task = 'biogr'\n","missing_combinations = []\n","for model in df.model.unique():\n"," for example in df.example.unique():\n"," if not df[\n"," (df['model'] == model)\n"," & (df['example'] == example)\n"," ].size:\n"," for metric in df.metric.unique():\n"," missing_combinations.append({\n"," 'model': model,\n"," 'task': task,\n"," 'metric': metric,\n"," 'example': example,\n"," 'value': 0,\n"," })\n","\n","# Create a DataFrame for the missing combinations\n","missing_df = pd.DataFrame(missing_combinations)\n","df = pd.concat([df, missing_df], ignore_index=True)\n"],"metadata":{"id":"cgYbq95WQHhc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["grouped = (\n"," df[\n"," df['metric'].isin(\n"," ['iou', 'normalized_distance_error', 'relative_box_size']\n"," )\n"," ]\n"," .groupby(['model', 'metric'])['value']\n"," .mean()\n",")"],"metadata":{"id":"wtrc_hJ0T9Zl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["models = [\n"," 'gemini-1.5-flash-latest',\n"," 'gemini-1.0-pro',\n"," 'gemini-1.5-pro-latest',\n"," 'gpt-4o',\n"," 'claude-3-opus-20240229',\n"," 'gemini-2.0-flash-latest',\n","]\n","\n","# Begin LaTeX table\n","latex = \"\"\"\n","\\\\begin{table}[ht]\n","\\\\centering\n","\\\\begin{tabular}{l|c|c|c}\n","\\\\hline\n","\\\\textbf{Model} & \\\\textbf{IoU} & \\\\textbf{Normalized Distance Error} & \\\\textbf{Relative Box Size} \\\\\\\\\n","\\\\hline\n","\"\"\"\n","\n","def print_row(model_id, model_name):\n"," try:\n"," val1 = f\"{grouped.loc[(model_id, 'iou')]:.2f}\"\n"," except:\n"," val1 = \"-\"\n"," try:\n"," val2 = f\"{grouped.loc[(model_id, 'normalized_distance_error')]:.2f}\"\n"," except:\n"," val2 = \"-\"\n"," try:\n"," val3 = f\"{grouped.loc[(model_id, 'relative_box_size')]:.2f}\"\n"," except:\n"," val3 = \"-\"\n"," return f\"{model_name} & {val1} & {val2} & {val3} \\\\\\\\\\n\"\n"," # return f\"{model_name} & {val1:.2f} & {val2:.2f} & {val3:.2f} \\\\\\\\\\n\"\n"," # return f\"{model_name} & {grouped.loc[(model_id, 'iou')]:.2f} & {grouped.loc[(model_id, 'normalized_distance_error')]:.2f} & {grouped.loc[(model_id, 'relative_box_size')]:.2f} \\\\\\\\\\n\"\n","\n","latex += print_row('claude-3-opus-20240229', \"Claude 3 (Opus)\")\n","latex += print_row('gemini-1.5-pro-latest', \"Gemini 1.5 Pro\")\n","latex += print_row('gemini-2.0-flash-latest','Gemini 2.0 Flash')\n","latex += print_row('gemini-1.5-flash-latest','Gemini 1.5 Flash')\n","latex += print_row( 'gpt-4o', \"GPT-4o\")\n","latex += print_row('gemini-1.0-pro','Gemini 1.0 Pro')\n","\n","\n","latex += \"\"\"\n","\\\\hline\n","\\\\end{tabular}\n","\\\\caption{\\\\textbf{Detailed Biogr analysis.}}\n","\\\\label{tab:curie_stats_qalens}\n","\\\\end{table}\n","\"\"\""],"metadata":{"id":"8B59LTeEegLl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["print(latex)"],"metadata":{"id":"fYjhl_MEPPGx"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"DqSIhPMCNG_R"},"source":["# Figures"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jnyJxaZigDPC"},"outputs":[],"source":["# Assuming that the first metric in the list of metric in TASKS_2_METRICS is\n","# the main score value\n","\n","TASKS_2_METRICS = {\n"," (\"dft\", \"extract_structure_data_1_shot\"): [\"LMSim-F1\", \"LMSim-Pr\", \"LMSim-Re\", \"rougeLsum\", \"bert_f1\"],\n"," (\"dft\", \"extract_dft_metadata_1_shot\"): [\"LMSim-F1\", \"LMSim-Pr\", \"LMSim-Re\", \"rougeLsum\", \"bert_f1\"],\n"," (\"dft\", \"write_code_for_paper_0_shot\"): [\"rougeLsum\", \"bert_f1\"],\n"," (\"mpv\", \"mat_paper_to_property_1_shot\"): [\"LMSim-F1\", \"LMSim-Pr\", \"LMSim-Re\", \"rougeLsum\", \"bert_f1\"],\n"," (\"hfd\", \"derivation_prompt\"): [\"rougeLsum\", \"bert_f1\"],\n"," (\"hfe\", \"extract_hamiltonian_0_shot\"): [\"rougeLsum\", \"bert_f1\"],\n"," (\"qecc_65\", \"describe_code_in_paper\"): [\"rougeLsum\", \"bert_f1\"],\n"," (\"geo\", \"extract_dataset_from_geo_papers_0_shot\"): [\"rougeLsum\", \"bert_f1\"],\n"," (\"biogr\", \"georeference_image_0_shot\"): [\"iou\"],\n"," (\"pdb\", \"reconstruct_protein_amino_acid_sequence_0_shot\"): [\n"," \"identity_ratio\"\n"," ],\n","}\n","TASKS_2_TITLES = {\n"," \"dft-s\": \"DFT-S\",\n"," \"dft-p\": \"DFT-P\",\n"," \"dft-c\": \"DFT-C\",\n"," \"mpv\": \"MPV\",\n"," \"hfd\": \"HFD\",\n"," \"hfe\": \"HFE\",\n"," \"qecc_65\": \"QECC\",\n"," \"geo\": \"GEO\",\n"," \"biogr\": \"BIOGR\",\n"," \"pdb\": \"PDB\",\n","}\n","\n","GEO_BAD_LICENSE = [\n"," \"4d5c098bd142b2356e5485f7e3786255aa636073\",\n"," \"40cb6b737064b0881c536512d61817dbe79a3da4\",\n"," \"bb871818b3e903ba70b5e90929a575cd018e0b2b\",\n"," \"cea27d393e1b9dcfab5f8f9f2fbd33e5bdd96e76\",\n"," \"e10fd24fe75f5c10d54698a0d141dc9151cb2535\",\n"," \"32ecb10ab170fa193ea879e1f63ce8ae5d7b9f34\",\n"," \"bfde3d73c1df8980a0c3915e627236ce818d42c7\",\n"," \"00f8c2660ea4795d25e8e801fc831bb9dcf64022\",\n"," \"5a7a518d77aee623be48bbee6538fdbd77c26238\",\n"," \"375bbe6ac4c3f3c16ddddaea2464f9f2e112e00a\",\n"," \"8349632fbb06bbce22012097f1030d1c53a8e57b\",\n"," \"467f0fdc420f5cd8996c0b2b1eb33a3dcda93c5e\",\n"," \"5c0e2e83c0d8d4e5d86ab77bea49c62ac77ab9e9\",\n","]\n","\n","model_order = [\n"," 'Gemini 2.0 Flash',\n"," 'Claude 3 (Opus)',\n"," 'Gemini 1.5 Pro',\n"," 'Gemini 1.5 Flash',\n"," 'GPT-4o',\n"," 'Command R+',\n"," 'Gemini 1.0 Pro',\n"," 'Mixtral-8x7b',\n"," 'LongLLaMA',\n","]\n","palette = [\n"," '#D9886C',\n"," '#E7B7A0',\n"," '#B3D1DF',\n"," '#89A8C0',\n"," '#6DA78B',\n"," '#346A67',\n"," '#E0C085',\n"," '#C49B3C',\n"," '#A569BD',\n"," '#C27B5A',\n","\n","]\n","\n","palette2 = [\n"," '#58508d',\n"," '#bc5090',\n"," '#ff6361',\n"," '#ffa600',\n","]\n","\n","palette3 = [\n"," '#BA68C8',\n"," '#7986CB',\n"," '#B2DFDB',\n","]\n","\n","TASKS_2_METRICS_FIGS = {\n"," \"dft-s\": \"LMSim-F1\",\n"," \"dft-p\": \"LMSim-F1\",\n"," \"dft-c\": \"rougeLsum\",\n"," \"mpv\": \"LMSim-F1\",\n"," \"hfd\": \"rougeLsum\",\n"," \"hfe\": \"rougeLsum\",\n"," \"qecc_65\": \"rougeLsum\",\n"," \"geo\": \"rougeLsum\",\n"," \"biogr\": \"iou\",\n"," \"pdb\": \"identity_ratio\",\n","}"]},{"cell_type":"code","source":["# Load the meta_results.json and convert it to dataframe\n","\n","all_results = json5.load(open(meta_results_path, 'r'))\n","df = pd.json_normalize(all_results, sep='$$').transpose()\n","\n","# Add separe columns for task, model, example, metric, and value\n","df['task'] = df.apply(lambda x: x.name.split('$$')[0], axis=1)\n","df['prompt'] = df.apply(lambda x: x.name.split('$$')[1], axis=1)\n","df['model'] = df.apply(lambda x: x.name.split('$$')[2], axis=1)\n","df['example'] = df.apply(lambda x: x.name.split('$$')[3], axis=1)\n","df['metric'] = df.apply(extract_metric, axis=1)\n","df['value'] = df[0]\n","\n","# Update the namings\n","df['task'] = df['task'].str.replace('mpve', 'mpv')\n","df['model'] = df['model'].str.replace('gpt-4o', 'GPT-4o')\n","df['model'] = df['model'].str.replace('claude-3-opus-20240229', 'Claude 3 (Opus)')\n","df['model'] = df['model'].str.replace('gemini-1.5-pro-latest', 'Gemini 1.5 Pro')\n","df['model'] = df['model'].str.replace('gemini-1.0-pro', 'Gemini 1.0 Pro')\n","df['model'] = df['model'].str.replace('gemini-1.5-flash-latest', 'Gemini 1.5 Flash')\n","df['model'] = df['model'].str.replace('gemini-2.0-flash-latest', 'Gemini 2.0 Flash')\n","df['model'] = df['model'].str.replace('longllama', 'LongLLaMA')\n","df['model'] = df['model'].str.replace('mixtral-gcp', 'Mixtral-8x7b')\n","df['model'] = df['model'].str.replace('command-r-plus', 'Command R+')\n","\n","df['task_prompt'] = df.apply(lambda x: '_'.join([x.task, x.prompt]), axis=1)\n","df['task_prompt_metric'] = df.apply(lambda x: '_'.join([x.task, x.prompt, x.metric]), axis=1)\n","\n","df_len = len(df)\n","print(f'The dataframe has \\033[1m{df_len}\\033[0m rows.')\n","\n","# Delete extra tasks, prompts, and metrics\n","task_prompt_metric_keep = []\n","for (task, prompt), metrics in TASKS_2_METRICS.items():\n"," for metric in metrics:\n"," task_prompt_metric_keep.append('_'.join([task, prompt, metric]))\n","\n","df = df.drop(df[~df['task_prompt_metric'].isin(task_prompt_metric_keep)].index)\n","print(\n"," f'\\033[1m{df_len-len(df)}\\033[0m rows contain extra tasks/prompts/metrics'\n"," ' and are dropped.'\n",")\n","\n","# Zero non numerical entries\n","df['value'] = df['value'].apply(lambda x: 0 if isinstance(x, str) else x)\n","df_len = len(df)\n","\n","# Delete geo examples with license issues\n","df = df.drop(df[df['example'].isin(GEO_BAD_LICENSE)].index)\n","print(\n"," f'\\033[1m{df_len-len(df)}\\033[0m rows contain license issues and are'\n"," ' dropped.'\n",")\n","df_len = len(df)\n","\n","# Divide the RougeLsum by 100\n","df.loc[df['metric'] == 'rougeLsum', 'value'] /= 100\n","\n","# Separe dft-p, dft-s, and dft-c\n","df['task'] = df.apply(\n"," lambda x: 'dft-s'\n"," if x.prompt == 'extract_structure_data_1_shot'\n"," else 'dft-p'\n"," if x.prompt == 'extract_dft_metadata_1_shot'\n"," else 'dft-c'\n"," if x.prompt == 'write_code_for_paper_0_shot'\n"," else x.task,\n"," axis=1,\n",")\n","\n","df_list = []\n","for task, metric in TASKS_2_METRICS_FIGS.items():\n"," df_list.append(df[(df['task'] == task) & (df['metric'] == metric)])\n","df = pd.concat(df_list)\n","\n","df = add_failed_examples(df)\n","print(f'The final dataframe has \\033[1m{len(df)}\\033[0m rows.')\n"],"metadata":{"id":"9oEaLekqQOKB"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XMmGzF73nAQ5"},"outputs":[],"source":["# Sanity check: Check the values of unique metrics for each task\n","\n","df.groupby('task')['metric'].unique()"]},{"cell_type":"markdown","metadata":{"id":"2lKM8mtuaSMt"},"source":["## Figure 6: barplots for all tasks"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"D964bMCMFNl7"},"outputs":[],"source":["label_font_size = 16\n","tick_label_size = 14\n","legend_font_size = 14\n","\n","\n","def plot_score(df, ylim=None):\n"," sns.set_theme(style='whitegrid')\n","\n"," df_list = []\n"," for task, metric in TASKS_2_METRICS_FIGS.items():\n"," df_list.append(df[(df['task'] == task) & (df['metric'] == metric)])\n"," df_score = pd.concat(df_list)\n"," df_score['Score'] = df_score['value']\n","\n","\n"," plt.figure(figsize=(20, 6))\n","\n"," ax = sns.barplot(\n"," x='task',\n"," y='Score',\n"," estimator=pd.Series.mean,\n"," errorbar=('pi', 50),\n"," data=df_score,\n"," hue='model',\n"," palette=palette,\n"," capsize=0.05,\n"," errwidth=0.75,\n"," hue_order=model_order,\n"," order=TASKS_2_TITLES.keys()\n"," )\n","\n"," ax.set_xlabel('Task', fontsize=label_font_size)\n"," ax.set_ylabel('Score', fontsize=label_font_size)\n"," ax.tick_params(labelsize=tick_label_size)\n"," ax.set_xticklabels(TASKS_2_TITLES.values(), rotation=0)\n","\n"," # Add hatch patterns to specific bars\n"," bars = ax.patches\n"," num_hatches = len(df['task'].unique())\n"," hatches = (\n"," ['o'] * num_hatches\n"," + [''] * num_hatches\n"," + [''] * num_hatches\n"," + [''] * num_hatches\n"," + ['//'] * num_hatches\n"," + [''] * num_hatches\n"," + ['\\\\'] * num_hatches\n"," )\n","\n"," for bar, hatch in zip(bars, hatches):\n"," bar.set_hatch(hatch)\n","\n"," plt.legend(loc='upper left', ncol=8)\n"," if ylim is not None:\n"," plt.ylim([0, ylim])\n","\n"," plt.savefig('Figure_6.png', format='png', dpi=300)\n"," plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"7ypnP0_za5bf"},"outputs":[],"source":["plot_score(df, 0.9)"]},{"cell_type":"markdown","metadata":{"id":"fhQwh6vrkwgY"},"source":["## Figure 2: average over tasks"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"6HxEeyeRcice"},"outputs":[],"source":["df_list = []\n","for task, metric in TASKS_2_METRICS_FIGS.items():\n"," df_list.append(df[(df['task'] == task) & (df['metric'] == metric)])\n","df_score = pd.concat(df_list)\n","df_score['Score'] = df_score['value']\n","\n","df_mean = df_score.groupby(['model', 'task'])['Score'].agg([\"mean\"]).unstack()\n","df_mean['Average Score'] = df_mean.mean(axis=1)\n","df_mean['model'] = df_mean.index\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ANx0KgXIebyH"},"outputs":[],"source":["# If a cell is NaN, we don't have any prediction for that domain with the model.\n","df_mean"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"m3NCnQcqmmUw"},"outputs":[],"source":["ax = sns.barplot(\n"," x='Average Score',\n"," y='model',\n"," data=df_mean,\n"," palette=palette,\n"," order=model_order,\n",");\n","\n","ax.set_ylabel('', fontsize=label_font_size)\n","ax.set_xlabel('Average Score', fontsize=label_font_size)\n","ax.tick_params(labelsize=tick_label_size)\n","plt.xlim([0, 0.35])\n","plt.xticks(fontsize=12)\n","plt.tight_layout()\n","plt.savefig('Figure_2.png', format='png', dpi=300)\n","plt.show()\n"]},{"cell_type":"markdown","metadata":{"id":"ZTAWck-Hq_ja"},"source":["## Figure 1: Compare Gemini 1.0 vs Gemini 1.5 on other benchmarks"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LTsQvB1MnZbi"},"outputs":[],"source":["# https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf\n","\n","metrics = {\n"," 1:{\n"," 'Model': 'Gemini 1.0 Pro',\n"," 'CURIE': float(df_mean[df_mean['model'] == 'Gemini 1.0 Pro'][\n"," 'Average Score'\n"," ].values[0]) * 100,\n"," 'DROP': 74.9, # Variable shots,\n"," 'GPQA': 27.9, # 4-shot,\n"," 'MMLU': 71.8, # 5-shot,\n"," },\n"," 2:{\n"," 'Model': 'Gemini 1.5 Pro',\n"," 'CURIE': float(df_mean[df_mean['model'] == 'Gemini 1.5 Pro'][\n"," 'Average Score'\n"," ].values[0]) * 100,\n"," 'DROP': 74.1, # Variable shots,\n"," 'GPQA': 46.2, # 0-shot,\n"," 'MMLU': 85.9, # 5-shot,\n"," },\n","}\n","\n","other_benchmarks = pd.DataFrame(metrics).transpose()"]},{"cell_type":"code","source":["# https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf\n","\n","metrics = {\n"," 1:{\n"," 'Model': 'Claude 3 (Opus)',\n"," 'CURIE': float(df_mean[df_mean['model'] == 'Claude 3 (Opus)'][\n"," 'Average Score'\n"," ].values[0]) * 100,\n"," 'ZeroScrolls': 39.07, # Variable shots,\n"," 'GPQA': 50.4, # 4-shot,\n"," 'MMLU-pro': 76.12, # 5-shot,\n"," 'MathVista': 50.5,\n"," 'RULER': 89,\n"," },\n"," 2:{\n"," 'Model': 'Gemini 1.5 Pro',\n"," 'CURIE': float(df_mean[df_mean['model'] == 'Gemini 1.5 Pro'][\n"," 'Average Score'\n"," ].values[0]) * 100,\n"," 'ZeroScrolls': np.nan,\n"," 'GPQA': 46.2, # 4-shot,\n"," 'MMLU-pro': 69.03, # 5-shot,\n"," 'MathVista': 63.9,\n"," 'RULER': 95.5,\n"," },\n"," 3:{\n"," 'Model': 'GPT-4o',\n"," 'CURIE': float(df_mean[df_mean['model'] == 'GPT-4o'][\n"," 'Average Score'\n"," ].values[0]) * 100,\n"," 'ZeroScrolls': 41.67, # Variable shots,\n"," 'GPQA': 50.4, # 4-shot,\n"," 'MMLU-pro': 76.12, # 5-shot,\n"," 'MathVista': 50.5,\n"," 'RULER': np.nan\n"," },\n","}\n","\n","other_benchmarks = pd.DataFrame(metrics).transpose()"],"metadata":{"id":"YMowdBZ94J_5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["df_mean['model']"],"metadata":{"id":"cmsPMG3A_zmB"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"bKCvdjJKwjhI"},"outputs":[],"source":["other_benchmarks"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LadlBQxpyK0s"},"outputs":[],"source":["other_benchmarks = other_benchmarks.melt(\n"," id_vars=['Model'], var_name='Benchmark', value_name='Score'\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"hEC6rAc5nuCh"},"outputs":[],"source":["ax = sns.barplot(\n"," y='Score',\n"," hue='Benchmark',\n"," x='Model',\n"," data=other_benchmarks,\n"," palette=palette4,\n"," hue_order=['CURIE', 'ZeroScrolls', 'GPQA', 'MathVista', 'MMLU-pro', 'RULER' ],\n",")\n","\n","# Get the legend handles and labels\n","handles, labels = plt.gca().get_legend_handles_labels()\n","\n","# Modify the legend labels\n","labels_info = {'RULER': 'Long Context', 'MMLU-pro': 'Understanding', 'MathVista': 'Reasoning in Visual Context', 'GPQA': 'Science Expertise', 'ZeroScrolls': \"Long Context\", 'CURIE':'Science + Long Context (ours)'}\n","labels = [f'{l}: {labels_info[l]}' for l in labels]\n","\n","# Set the legend labels\n","plt.legend(handles, labels, loc='upper center', ncol=2, bbox_to_anchor=(0.5, 1.35), borderaxespad=0.)\n","plt.xlabel('')\n","\n","plt.tight_layout()\n","\n","plt.savefig('Figure_1.png', format='png', dpi=300)\n","plt.show()"]},{"cell_type":"markdown","metadata":{"id":"kqbyxXnzLMMg"},"source":["## Figure 7: separate results by difficulty levels"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"EfEePOlyLMAg"},"outputs":[],"source":["import json5\n","import pandas as pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"yEOH5Ps-MFLS"},"outputs":[],"source":["# Load the difficulty level json file\n","\n","difficulty_json = json5.load(\n"," open(\n"," f\"{BASEDIR}/data/difficulty_levels.json\"\n"," )\n",")\n","\n","df_lists = []\n","for task, difficulties in difficulty_json.items():\n"," df_diff = pd.DataFrame(difficulties, index=['difficulty']).transpose()\n"," df_diff['task'] = len(df_diff) * [task]\n"," df_diff['example'] = df_diff.index\n"," df_diff = df_diff.reset_index()\n"," df_diff = df_diff.drop(columns=['index'])\n"," df_lists.append(df_diff)\n","\n","df_diff = pd.concat(df_lists)\n","df_diff = df_diff.replace('qecc_85', 'qecc_65')\n","df_diff = df_diff.replace('mpve', 'mpv')\n","df_diff['task_example'] = df_diff.apply(\n"," lambda x: x['task'] + '_' + x['example'], axis=1\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Ov3TIaBHQzKF"},"outputs":[],"source":["# Add difficulty levels to the df_score\n","\n","df_score['task'] = df_score['task'].apply(lambda x: 'dft' if 'dft' in x else x)\n","\n","df_score['task_example'] = df_score.apply(\n"," lambda x: x['task'] + '_' + x['example'], axis=1\n",")\n","df_score = df_score.merge(df_diff, on=['task_example'], how='left')\n","\n","# Quick check that they merged properly\n","collections.Counter(list(df_score['difficulty']))"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tsXebTV3LUNc"},"outputs":[],"source":["# Average scores over all tasks, separated by example difficulties\n","df_score['difficulty'] = df_score['difficulty'].apply(lambda x: x.lower())\n","df_mean_hard = df_score[df_score['difficulty']=='hard'].groupby(['model', 'task_x'])['Score'].agg([\"mean\"]).unstack()\n","df_mean_hard['Average Score'] = df_mean_hard.mean(axis=1)\n","df_mean_hard['model'] = df_mean_hard.index\n","\n","\n","df_mean_medium = df_score[df_score['difficulty']=='medium'].groupby(['model', 'task_x'])['Score'].agg([\"mean\"]).unstack()\n","df_mean_medium['Average Score'] = df_mean_medium.mean(axis=1)\n","df_mean_medium['model'] = df_mean_medium.index\n","\n","df_mean_easy = df_score[df_score['difficulty']=='easy'].groupby(['model', 'task_x'])['Score'].agg([\"mean\"]).unstack()\n","df_mean_easy['Average Score'] = df_mean_easy.mean(axis=1)\n","df_mean_easy['model'] = df_mean_easy.index"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"czeXOsZ4-P6D"},"outputs":[],"source":["df_mean = (\n"," df_score.groupby(['model', 'difficulty'])['Score']\n"," .agg(['mean'])\n"," .unstack()\n",")\n","\n","df_mean = df_mean.reset_index()\n","df_mean.columns = ['_'.join(x) for x in df_mean.columns]\n","df_mean.rename(columns={'model_': 'model', 'mean_easy': 'Easy', 'mean_medium': 'Medium', 'mean_hard': 'Hard'}, inplace=True)\n","\n","df_mean = df_mean.melt(\n"," id_vars=['model'],\n"," value_vars=['Easy', 'Medium', 'Hard'],\n"," var_name='difficulty',\n"," value_name='mean score',\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"6k2HBIOXBqf0"},"outputs":[],"source":["plt.figure(figsize=(12, 5))\n","ax = sns.barplot(\n"," y='mean score',\n"," x='model',\n"," data=df_mean,\n"," palette=palette3,\n"," hue='difficulty',\n"," order=model_order,\n",")\n","plt.tight_layout()\n","plt.savefig('Figure_8.png', format='png', dpi=300)\n","plt.show()"]},{"cell_type":"markdown","source":["## Figure 35: identity ratio versus protein sequence length"],"metadata":{"id":"HqIAOnVGXYYj"}},{"cell_type":"code","source":["import os\n","import glob\n","df_pdb = df[df['task'] == 'pdb']"],"metadata":{"id":"f_BXQF4uXs11"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["gt_pattern = f\"{BASEDIR}/data/pdb/ground_truth/*.json\"\n","\n","gt_pdb = {}\n","for path in glob.glob(gt_pattern):\n"," example = os.path.basename(path).split('.json')[0]\n"," gt_pdb[example] = {'json': json5.load(open(path, 'r'))}\n","\n","for k, v in gt_pdb.items():\n"," v['sequence_length'] = len(v['json']['sequence'])\n","\n","df_pdb['sequence_length'] = df_pdb.apply(lambda x: gt_pdb[x['example']]['sequence_length'], axis=1)\n","df_pdb['sequence_length_bins'] = pd.qcut(df_pdb['sequence_length'], q=20)\n","\n","plt.figure(figsize=(21, 7))\n","sns.set_style('white')\n","sns.set(font_scale=1.5)\n","ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: \"{:.0f}\".format(x)))\n","\n","sns.boxplot(\n"," df_pdb[df_pdb['metric'] == 'identity_ratio'],\n"," y='value',\n"," x='sequence_length_bins',\n",")\n","plt.ylim([0, 1])\n","plt.xticks(rotation=90)\n","plt.savefig('pdb_supp.png', format='png', bbox_inches='tight', dpi=300)\n","\n","plt.show()"],"metadata":{"id":"qlFge3AjGSAo"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"BBIkw0JsT8OV"},"execution_count":null,"outputs":[]}],"metadata":{"colab":{"last_runtime":{"build_target":"","kind":"shared"},"private_outputs":true,"provenance":[{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb","timestamp":1725396000912},{"file_id":"1kVpGcgKbPgXsQTxmigipjN-ljPM5PT13","timestamp":1719886313973},{"file_id":"1rlruCg6PAYZnSEa96gGr2nonrSbokPur","timestamp":1719801054734},{"file_id":"1iKAX1S8xqdOYQ-xb5jsZ71Rawj-kFKWq","timestamp":1718492553434},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb","timestamp":1717653306344},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb?workspaceId=shamsiz:ag3::citc","timestamp":1717611597311},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb","timestamp":1717544306678},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helpers.ipynb?workspaceId=vsubhashini:textable::citc","timestamp":1717542909019},{"file_id":"1nSpN7dCWzccH7HOXdUnR8rusYJVIzbFE","timestamp":1717542719440},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb","timestamp":1717471223674},{"file_id":"/piper/depot/google3/research/biology/collaborations/sci_asst/colabs/benchamark_paper_writing_helper.ipynb","timestamp":1717467236222},{"file_id":"12iBI1WxeWyladNhZrI7_jpDb6y6wzMyd","timestamp":1717110410072}],"toc_visible":true},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0} --------------------------------------------------------------------------------