Community curated database of the metagenome of oil and oil fieldsđ˘ď¸đŚ
This list contains information of oil connected metagenomes which include crude oil samples, samples from the different parts of oil production system in oil fields, also the samples from contaminated soil and water. This database is crucial for the use of metagenomic oil samples in machine learning. It follows the FAIR data organization principles. Here you can find the name of publications, years of publications, DOI, type of samples, geographic location of sample collection, and associated physical or chemical conditions.
The database will release bug fix updates on issues from the GitHub that have been reported every half a year. You can always create new issues and make a tag to become our contributor.
Our next update will include a web-based application for the user-friendly addition of new specimen information to the database.
If you find dataset validation errors or think of a new dataset validation, please create an issue in our GitHub repository.
The SAMPLE tables stores information about the sample before it was sequenced: type, date of collection, geographic coordinates, depth, temperature of sample extraction, material of sample, etc.
Numeric and text fields must be filled in with âNoneâ (capitalized) to indicate âthere can be no value meaningfullyâ or âunknownnâ to indicate âvalue is not known, but theoretically could beâ. For example, if only water is being tested in a sample and there is no oil in the sample, the API index would be âNoneâ because there is no oil in that sample and there can be no viscosity index. Conversely, if the viscosity index in the âcrude oilâ sample was not measured, then the API column should be set to âunknownâ.
All column with âdefined categoriesâ should be validated against
schemas_samples/<column>.json
. This is necessary to ensure data consistency.
If you wish to a new category, please consult with the agni-bioinformatics-lab, and then add it to schemas_samples/<column>.json
.
Sample columns are as follows (documentation):
The LIBRARIES tables store information about each specific reed from the library - id_ in databases, sequencing type (paired-end, single-end), sequencing strategy (WGS, RNA-Seq, amplicon), links to downloads and publications, etc.
Numeric and text fields must be filled in with âNoneâ (capitalized) to indicate âthere can be no value meaningfullyâ or âunknownnâ to indicate âvalue is not known, but theoretically could beâ. For example, if only water is being tested in a sample and there is no oil in the sample, the API index would be âNoneâ because there is no oil in that sample and there can be no viscosity index. Conversely, if the viscosity index in the âcrude oilâ sample was not measured, then the API column should be set to âunknownâ.
All column with âdefined categoriesâ should be validated against
schemas_libraries/<column>.json
. This is to ensure data consistency. E.g., all
libraries sequenced on Illumina NextSeq 500s are listed as NextSeq 500
(as
defined in schemas_libraries/instrument_model.json
). This is to ensure data
consistency.
Library columns are as follows (documentation):
Samples added to the OilMetagenomeDB should come from published studies. Samples should also be available in publicly accessible databases (e.g., EBI ENA or NCBI SRA).
When filling in the data, each sample from the publication will get a new row. For guidance on what information to add to each column see the README.md section âUsageâ for a handy guide.
We aimed to create a comprehensive database to serve as the foundation for our machine learning models, as no existing database provides such extensive insights into oil metagenomic patterns. And our team would like to express our gratitude to AncientMetagenomeDir for inspiring us to create this public database. We would also like to thank ITMO University, AGNI and Tatneft for supporting the project.