OilMetagenomesDB

Community curated database of oil metagenomes

github-watchers github-stars github-license


Community octocat curated database of the metagenome of oil and oil fields🛢️🦠

Description

This list contains information of oil connected metagenomes which include crude oil samples, samples from the different parts of oil production system in oil fields, also the samples from contaminated soil and water. This database is crucial for the use of metagenomic oil samples in machine learning. It follows the FAIR data organization principles. Here you can find the name of publications, years of publications, DOI, type of samples, geographic location of sample collection, and associated physical or chemical conditions.

The database will release bug fix updates on issues from the GitHub that have been reported every half a year. You can always create new issues and make a tag to become our contributor.

Our next update will include a web-based application for the user-friendly addition of new specimen information to the database. My Map

Usage

🎋Click for branch creation tutorial ![git_gif_guide_1](/OilMetagenomesDB/assets/gif/git_gif_guide_1.gif)
📋Click to learn data addition and pull request review ![git_gif_guide_2](/OilMetagenomesDB/assets/gif/git_gif_guide_2.gif)
⚙️Click to view data validation ![git_gif_guide_3](/OilMetagenomesDB/assets/gif/git_gif_guide_3.gif)

If you find dataset validation errors or think of a new dataset validation, please create an issue in our GitHub repository.

Samples Column Specifications

The SAMPLE tables stores information about the sample before it was sequenced: type, date of collection, geographic coordinates, depth, temperature of sample extraction, material of sample, etc.

Numeric and text fields must be filled in with “None” (capitalized) to indicate ‘there can be no value meaningfully’ or “unknownn” to indicate ‘value is not known, but theoretically could be’. For example, if only water is being tested in a sample and there is no oil in the sample, the API index would be “None” because there is no oil in that sample and there can be no viscosity index. Conversely, if the viscosity index in the “crude oil” sample was not measured, then the API column should be set to “unknown”.

All column with ‘defined categories’ should be validated against schemas_samples/<column>.json. This is necessary to ensure data consistency.

If you wish to a new category, please consult with the agni-bioinformatics-lab, and then add it to schemas_samples/<column>.json.

Sample columns are as follows (documentation):

Libraries Column Specifications

The LIBRARIES tables store information about each specific reed from the library - id_ in databases, sequencing type (paired-end, single-end), sequencing strategy (WGS, RNA-Seq, amplicon), links to downloads and publications, etc.

Numeric and text fields must be filled in with “None” (capitalized) to indicate ‘there can be no value meaningfully’ or “unknownn” to indicate ‘value is not known, but theoretically could be’. For example, if only water is being tested in a sample and there is no oil in the sample, the API index would be “None” because there is no oil in that sample and there can be no viscosity index. Conversely, if the viscosity index in the “crude oil” sample was not measured, then the API column should be set to “unknown”.

All column with ‘defined categories’ should be validated against schemas_libraries/<column>.json. This is to ensure data consistency. E.g., all libraries sequenced on Illumina NextSeq 500s are listed as NextSeq 500 (as defined in schemas_libraries/instrument_model.json). This is to ensure data consistency.

Library columns are as follows (documentation):

Contributing

Samples added to the OilMetagenomeDB should come from published studies. Samples should also be available in publicly accessible databases (e.g., EBI ENA or NCBI SRA).

When filling in the data, each sample from the publication will get a new row. For guidance on what information to add to each column see the README.md section “Usage” for a handy guide.

Some tips

Anknowlegments

We aimed to create a comprehensive database to serve as the foundation for our machine learning models, as no existing database provides such extensive insights into oil metagenomic patterns. And our team would like to express our gratitude to AncientMetagenomeDir for inspiring us to create this public database. We would also like to thank ITMO University, AGNI and Tatneft for supporting the project.