E. Coli Whole Cell Model by Covert Lab (CovertLab/WholeCellEcoliRelease)

github.com/CovertLab/WholeCellEcoliRelease is a whole cell simulation model created by Covert Lab and other collaborators.

The project is written in Python, hurray!

But according to te README, it seems to be the use a code drop model with on-request access to master. Ciro Santilli asked at rationale on GitHub discussion, and they confirmed as expected that it is to:

to prevent their publication ideas from being stolen. Who would steal publication ideas with public proof in an issue tracker without crediting original authors? Academia is broken. Academia should be the most open form of knowledge sharing. But instead we get this silly competition for publication points.
to prevent noise from non-collaborators. But they only get like 2 issues as year on such a meganiche subject... Did you know that you can ignore people, and even block them if they are particularly annoying? Much more likely is that no one will every hear about your project and that it will die with its last graduate student slave.

The project is a followup to the earlier M. genitalium whole cell model by Covert lab which modelled Mycoplasma genitalium. E. Coli has 8x more genes (500 vs 4k), but it the undisputed bacterial model organism and as such has been studied much more thoroughly. It also reproduces faster than Mycoplasma (20 minutes vs a few hours), which is a huge advantages for validation/exploratory experiments.

The project has a partial dependency on the proprietary optimization software CPLEX which is freeware, for students, not sure what it is used for exactly, from the comment in the requirements.txt the dependency is only partial.

This project makes Ciro Santilli think of the E. Coli as an optimization problem. Given such external nutrient/temperature condition, which DNA sequence makes the cell grow the fastest? Balancing metabolites feels like designing a Factorio speedrun.

There is one major thing missing thing in the current model: promoters/transcription factor interactions are not modelled due to lack/low quality of experimental data: github.com/CovertLab/WholeCellEcoliRelease/issues/21. They just have a magic direct "transcription factor to gene" relationship, encoded at reconstruction/ecoli/flat/foldChanges.tsv in terms of type "if this is present, such protein is expressed 10x more". Transcription units are not implemented at all it appears.

Everything in this section refers to version 7e4cc9e57de76752df0f4e32eca95fb653ea64e4, the code drop from November 2020, and was tested on Ubuntu 21.04 with a docker install of docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full with image id 502c3e604265, unless otherwise noted.

1. Install and first run

Words: 327

At 7e4cc9e57de76752df0f4e32eca95fb653ea64e4 you basically need to use the Docker image on Ubuntu 21.04 due to pip breaking changes... (not their fault). Perhaps pyenv would solve things, but who has the patience for that?!?!

The Docker setup from README does just work. The image download is a bit tedius, as it requires you to create a GitHub API key as described in the README, but there must be reasons for that.

Once the image is downloaded, you really want to run is from the root of the source tree:

sudo docker run --name=wcm -it -v "$(pwd):/wcEcoli" docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full

This mounts the host source under /wcEcoli, so you can easily edit and view output images from your host. Once inside Docker we can compile, run the simulation, and analyze results with:

make clean compile &&
python runscripts/manual/runFitter.py &&
python runscripts/manual/runSim.py &&
python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py &&
python runscripts/manual/analysisMultigen.py &&
python runscripts/manual/analysisSingle.py

The meaning of each of the analysis commands is described at Section 2. "Output overview".

As a Docker refresher, after you stop the container, e.g. by restarting your computer or running sudo docker stop wcm, you can get back into it with:

sudo docker start wcm
sudo docker run -it wcm bash

runscripts/manual/runFitter.py takes about 15 minutes, and it generates files such as reconstruction/ecoli/dataclasses/process/two_component_system.py (related) which is required to run the simulation, it is basically a part of the build.

runSim.py does the main simulation, progress output contains lines of type:

Time (s)  Dry mass     Dry mass      Protein          RNA    Small mol     Expected
              (fg)  fold change  fold change  fold change  fold change  fold change
========  ========  ===========  ===========  ===========  ===========  ===========
    0.00    403.09        1.000        1.000        1.000        1.000        1.000
    0.20    403.18        1.000        1.000        1.000        1.000        1.000

and then it ended on the Lenovo ThinkPad P51 (2017) at:

 2569.18    783.09        1.943        1.910        2.005        1.950        1.963

Simulation finished:
 - Length: 0:42:49
 - Runtime: 0:09:13

when the cell had almost doubled, and presumably divided in 42 minutes of simulated time, which could make sense compared to the 20 under optimal conditions.

2. Output overview

Words: 681 Articles: 1

Run output is placed under out/:

Some of the output data is stored as .cpickle files. To observe those files, you need the original Python classes, and therefore you have to be inside Docker, from the host it won't work.

We can list all the plots that have been produced under out/ with

find -name '*.png'

Plots are also available in SVG and PDF formats, e.g.:

PNG: ./out/manual/plotOut/low_res_plots/massFractionSummary.png
SVG: ./out/manual/plotOut/svg_plots/massFractionSummary.svg The SVGs write text as polygons, see also: SVG fonts.
PDF: ./out/manual/plotOut/massFractionSummary.pdf

The output directory has a hierarchical structure of type:

./out/manual/wildtype_000000/000000/generation_000000/000000/

where:

wildtype_000000: variant conditions. wildtype is a human readable label, and 000000 is an index amongst the possible wildtype conditions. For example, we can have different simulations with different nutrients, or different DNA sequences. An example of this is shown at run variants.
000000: initial random seed for the initial cell, likely fed to NumPy's np.random.seed
genereation_000000: this will increase with generations if we simulate multiple cells, which is supported by the model
000000: this will presumably contain the cell index within a generation

We also understand that some of the top level directories contain summaries over all cells, e.g. the massFractionSummary.pdf plot exists at several levels of the hierarchy:

./out/manual/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/massFractionSummary.pdf

Each of thoes four levels of plotOut is generated by a different one of the analysis scripts:

./out/manual/plotOut: generated by python runscripts/manual/analysisVariant.py. Contains comparisons of different variant conditions. We confirm this by looking at the results of run variants.
./out/manual/wildtype_000000/plotOut: generated by python runscripts/manual/analysisCohort.py --variant_index 0. TODO not sure how to differentiate between two different labels e.g. wildtype_000000 and somethingElse_000000. If -v is not given, a it just picks the first one alphabetically. TODO not sure how to automatically generate all of those plots without inspecting the directories.
./out/manual/wildtype_000000/000000/plotOut: generated by python runscripts/manual/analysisMultigen.py --variant_index 0 --seed 0
./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut: generated by python runscripts/manual/analysisSingle.py --variant_index 0 --seed 0 --generation 0 --daughter 0. Contains information about a single specific cell.

2.1. Mass fraction summary plot analysis

Words: 351

Let's look into a sample plot, out/manual/plotOut/svg_plots/massFractionSummary.svg, and try to understand as much as we can about what it means and how it was generated.

This plot contains how much of each type of mass is present in all cells. Since we simulated just one cell, it will be the same as the results for that cell.

We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential.

total dry mass (mass excluding water)
protein mass
rRNA mass
mRNA mass
DNA mass. The last label is not very visible on the plots, but we can deduce it from the source code.

By grepping the title "Cell mass fractions" in the source code, we see the files:

models/ecoli/analysis/cohort/massFractionSummary.py
models/ecoli/analysis/multigen/massFractionSummary.py
models/ecoli/analysis/variant/massFractionSummary.py

which must correspond to the different massFractionSummary plots throughout different levels of the hierarchy.

By reading models/ecoli/analysis/variant/massFractionSummary.py a little bit, we see that:

the plotting is done with Matplotlib, hurray
it is reading its data from files under ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/, more precisely ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/columns/<column-name>/data. They are binary files however.
Looking at the source for wholecell/io/tablereader.py shows that those are just a standard NumPy serialization mechanism. Maybe they should have used the Hierarchical Data Format instead.
We can also take this opportunity to try and find where the data is coming from. Mass from the ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/ looks like an ID, so we grep that and we reach models/ecoli/listeners/mass.py.
From this we understand that all data that is to be saved from a simulation must be coming from listeners: likely nothing, or not much, is dumped by default, because otherwise it would take up too much disk space. You have to explicitly say what it is that you want to save via a listener that acts on each time step.

Figure 1.
Minimal condition mass fraction plot
. Source. File name: `out/manual/plotOut/svg_plots/massFractionSummary.svg`

More plot types will be explored at time series run variant, where we will contrast two runs with different growth mediums.

3. Run variants

Words: 928 Articles: 2

It would be boring if we could only simulate the same condition all the time, so let's have a look at the different boundary conditions that we can apply to the cell!

We are able to alter things like the composition of the external medium, and the genome of the bacteria, which will make the bacteria behave differently.

The variant selection is a bit cumbersome as we have to use indexes instead of names, but one you know what you are doing, it is fine.

Of course, genetic modification is limited only to experimentally known protein interactions due to the intractability of computational protein folding and computational chemistry in general, solving those would bsai.

3.1. Default run variant

Words: 80

The default run variant, if you don't pass any options, just has the minimal growth conditions set. What this means can be seen at condition.

Notably, this implies a growth medium that includes glucose and salt. It also includes oxygen, which is not strictly required, but greatly benefits cell growth, and is of course easier to have than not have as it is part of the atmosphere!

But the medium does not include amino acids, which the bacteria will have to produce by itself.

3.2. Time series run variant

Words: 741

To modify the nutrients as a function of time, with To select a time series we can use something like:

python runscripts/manual/runSim.py --variant nutrientTimeSeries 25 25

As mentioned in python runscripts/manual/runSim.py --help, nutrientTimeSeries is one of the choices from github.com/CovertLab/WholeCellEcoliRelease/blob/7e4cc9e57de76752df0f4e32eca95fb653ea64e4/models/ecoli/sim/variants/__init__.py#L57

25 25 means to start from index 25 and also end at 25, so running just one simulation. 25 27 would run 25 then 26 and then 27 for example.

The timeseries with index 25 is reconstruction/ecoli/flat/condition/timeseries/000025_cut_aa.tsv and contains

"time (units.s)" "nutrients"
0 "minimal_plus_amino_acids"
1200 "minimal"

so we understand that it starts with extra amino acids in the medium, which benefit the cell, and half way through those are removed at time 1200s = 20 minutes. We would therefore expect the cell to start expressing amino acid production genes exactly at that point.

nutrients likely means condition in that file however, see bug report with 1 1 failing: github.com/CovertLab/WholeCellEcoliRelease/issues/24

When we do this the simulation ends in:

Simulation finished:
 - Length: 0:34:23
 - Runtime: 0:08:03

so we see that the doubling time was faster than the one with minimal conditions of 0:42:49, which makes sense, since during the first 20 minutes the cell had extra amino acid nutrients at its disposal.

The output directory now contains simulation output data under out/manual/nutrientTimeSeries_000025/. Let's run analysis and plots for that:

python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py --variant 25 &&
python runscripts/manual/analysisMultigen.py --variant 25 &&
python runscripts/manual/analysisSingle.py --variant 25

We can now compare the outputs of this run to the default wildtype_000000 run from Section 1. "Install and first run".

out/manual/plotOut/svg_plots/massFractionSummary.svg: because we now have two variants in the same out/ folder, wildtype_000000 and nutrientTimeSeries_000025, we now see a side by side comparision of both on the same graph!
The run variant where we started with amino acids initially grows faster as expected, because the cell didn't have to make it's own amino acids, so growth is a bit more efficient.
Then, at 20 minutes, which is about 0.3 hours, we see that the cell starts growing a bit less fast as the slope of the curve decreases a bit, because we removed that free amino acid supply.
Figure 2.
Minimal condition vs amino acid cut mass fraction plot
. Source. From file out/manual/plotOut/svg_plots/massFractionSummary.svg.

The following plots from under out/manual/wildtype_000000/000000/{generation_000000,nutrientTimeSeries_000025}/000000/plotOut/svg_plots have been manually joined side-by-side with:

for f in out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/*; do
  echo $f
  svg_stack.py \
    --direction h \
    out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    out/manual/nutrientTimeSeries_000025/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    > tmp/$(basename $f)
done

Figure 3.
Amino acid counts
. Source. `aaCounts.svg`:
default: quantities just increase
amino acid cut: there is an abrupt fall at 20 minutes when we cut off external supply, presumably because it takes some time for the cell to start producing its own

Figure 4.
External exchange fluxes of amino acids
. Source. `aaExchangeFluxes.svg`:
default: no exchanges
amino acid cut: for all graphs except phenylalanine (PHE), either the cell was intaking the AA (negative flux), and that intake goes to 0 when the supply is cut, or the flux is always 0.
For PHE however, the flux is at all times, except shortly after the cut. Why? And why there was no excretion on the default conditions?

Figure 5.
Evaluation time
. Source. `evaluationTime.svg`: this has nothing to do with biology, but it is rather a profile of the program runtime. We can see that the simulation gets slower and slower as time passes, presumably because there are more and more molecules to simulate.

Figure 6.
mRNA count of highly expressed mRNAs
. Source. From file `expression_rna_03_high.svg`. Each of the entries is a gene using the conventional gene naming convention of `xyzW`, e.g. here's the BioCyc for the first entry, `tufA`: biocyc.org/gene?orgid=ECOLI&id=EG11036, which comments
Elongation factor Tu (EF-Tu) is the most abundant protein in E. coli.
and
In E. coli, EF-Tu is encoded by two genes, tufA and tufB
. What they seem to mean is that tufA and tufB are two similar molecules, either of which can make up the EF-Tu of the E. Coli, which is an important part of translation.

Figure 7.
External exchange fluxes
. Source.
`mediaExcange.svg`: this one is similar to `aaExchangeFluxes.svg`, but it also tracks other substances. The color version makes it easier to squeeze more substances in a given space, but you lose the shape of curves a bit. The title seems reversed: red must be excretion, since that's where glucose (GLC) is.
The substances are different between the default and amino acid cut graphs, they seem to be the most exchanged substances. On the amino cut graph, first we see the cell intaking most (except phenylalanine, which is excreted for some reason). When we cut amino acids, the uptake of course stops.

4. Other run variants

Words: 75

Besides time series run variants, conditions can also be selected directly without a time series as in:

python runscripts/manual/runSim.py --variant condition 1 1

which select row indices from reconstruction/ecoli/flat/condition/condition_defs.tsv. The above 1 1 would mean the second line of that file which starts with:

"condition" "nutrients" "genotype perturbations" "doubling time (units.min)" "active TFs"
"basal" "minimal" {} 44.0 []
"no_oxygen" "minimal_minus_oxygen" {} 100.0 []
"with_aa" "minimal_plus_amino_acids" {} 25.0 ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]

so 1 means no_oxygen.

5. Source code overview

Words: 1k Articles: 1

The key model database is located in the source code at reconstruction/ecoli/flat.

Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".

We'll realize that a lot of data and IDs come from/match BioCyc quite closely.

reconstruction/ecoli/flat/compartments.tsv contains cellular compartment information:
```
"abbrev" "id"
"n" "CCO-BAC-NUCLEOID"
"j" "CCO-CELL-PROJECTION"
"w" "CCO-CW-BAC-NEG"
"c" "CCO-CYTOSOL"
"e" "CCO-EXTRACELLULAR"
"m" "CCO-MEMBRANE"
"o" "CCO-OUTER-MEM"
"p" "CCO-PERI-BAC"
"l" "CCO-PILUS"
"i" "CCO-PM-BAC-NEG"
```
- CCO: "Celular COmpartment"
- BAC-NUCLEOID: nucleoid
- CELL-PROJECTION: cell projection
- CW-BAC-NEG: TODO confirm: cell wall (of a Gram-negative bacteria)
- CYTOSOL: cytosol
- EXTRACELLULAR: outside the cell
- MEMBRANE: cell membrane
- OUTER-MEM: bacterial outer membrane
- PERI-BAC: periplasm
- PILUS: pilus
- PM-BAC-NEG: TODO: plasma membrane, but that is the same as cell membrane no?
reconstruction/ecoli/flat/promoters.tsv contains promoter information. Simple file, sample lines:
```
"position" "direction" "id" "name"
148 "+" "PM00249" "thrLp"
```
corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.
reconstruction/ecoli/flat/proteins.tsv contains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:
```
"aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
[91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
```
so we understand that:
- aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
- seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
- mw; molecular weight? The 11 components appear to be given at reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:
  molecular_weight_keys = [ '23srRNA', '16srRNA', '5srRNA', 'tRNA', 'mRNA', 'miscRNA', 'protein', 'metabolite', 'water', 'DNA', 'RNA' # nonspecific RNA ]
  so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
  - 23srRNA, 16srRNA, 5srRNA are the three structural RNAs present in the ribosome: 23S ribosomal RNA, 16S ribosomal RNA, 5S ribosomal RNA, all others are obvious:
  - tRNA
  - mRNA
  - protein. This is the seventh class, and this enzyme only contains mass in this class as expected.
  - metabolite
  - water
  - DNA
  - RNA: TODO rna vs miscRNA
- location: cell compartment where the protein is present, c defined at reconstruction/ecoli/flat/compartments.tsv as cytoplasm, as expected for something that will make an amino acid
reconstruction/ecoli/flat/rnas.tsv: TODO vs transcriptionUnits.tsv. Sample lines:
```
"halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
```
- halfLife: half-life
- mw: molecular weight, same as in reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the mRNA class, as expected, as it just codes for a protein
- location: same as in reconstruction/ecoli/flat/proteins.tsv
- ntCount: nucleotide count for each of the ATGC
- microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?

reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:

>E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG

reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:
```
"expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
```
- promoter_id: matches promoter id in reconstruction/ecoli/flat/promoters.tsv
- gene_id: matches id in reconstruction/ecoli/flat/genes.tsv
- id: matches exactly those used in BioCyc, which is quite nice, might be more or less standardized:
  - biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486
  - biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=TU00178

reconstruction/ecoli/flat/genes.tsv

"length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"

reconstruction/ecoli/flat/metabolites.tsv contains metabolite information. Sample lines:
```
"id"                       "mw7.2" "location"
"HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
"L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
```
In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".
Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:
- biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID L-ASPARTATE-SEMIALDEHYDE
- biocyc.org/compound?orgid=ECOLI&id=HOMO-SER: "Homoserine" has ID HOMO-SER
so these are the compounds that we care about.

reconstruction/ecoli/flat/reactions.tsv contains chemical reaction information. Sample lines:

"reaction id" "stoichiometry" "is reversible" "catalyzed by"

"HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
  {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

"HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
  {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

catalized by: here we see ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the thrA we care about! This is confirmed in complexationReactions.tsv.

reconstruction/ecoli/flat/complexationReactions.tsv contains information about chemical reactions that produce protein complexes:
```
"process" "stoichiometry" "id" "dir"
"complexation"
  [
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
      "coeff": 1,
      "type": "proteincomplex",
      "location": "c",
      "form": "mature"
    },
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
      "coeff": -4,
      "type": "proteinmonomer",
      "location": "c",
      "form": "mature"
    }
  ]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
1
```
The coeff is how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:
Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
Fantastic literature summary! Can't find that in database form there however.

reconstruction/ecoli/flat/proteinComplexes.tsv contains protein complex information:

"name" "comments" "mw" "location" "reactionId" "id"
"aspartate kinase / homoserine dehydrogenase"
""
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
["c"]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
"ASPKINIHOMOSERDEHYDROGI-CPLX"

reconstruction/ecoli/flat/protein_half_lives.tsv contains the half-life of proteins. Very few proteins are listed however for some reason.

reconstruction/ecoli/flat/tfIds.csv: transcription factors information:

"TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
"arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
"fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
"dksA" "EG10230"

5.1. Condition

Words: 471

reconstruction/ecoli/flat/condition/nutrient/minimal.tsv contains the nutrients in a minimal environment in which the cell survives:

"molecule id" "lower bound (units.mmol / units.g / units.h)" "upper bound (units.mmol / units.g / units.h)"
"ADP[c]" 3.15 3.15
"PI[c]" 3.15 3.15
"PROTON[c]" 3.15 3.15
"GLC[p]" NaN 20
"OXYGEN-MOLECULE[p]" NaN NaN
"AMMONIUM[c]" NaN NaN
"PI[p]" NaN NaN
"K+[p]" NaN NaN
"SULFATE[p]" NaN NaN
"FE+2[p]" NaN NaN
"CA+2[p]" NaN NaN
"CL-[p]" NaN NaN
"CO+2[p]" NaN NaN
"MG+2[p]" NaN NaN
"MN+2[p]" NaN NaN
"NI+2[p]" NaN NaN
"ZN+2[p]" NaN NaN
"WATER[p]" NaN NaN
"CARBON-DIOXIDE[p]" NaN NaN
"CPD0-1958[p]" NaN NaN
"L-SELENOCYSTEINE[c]" NaN NaN
"GLC-D-LACTONE[c]" NaN NaN
"CYTOSINE[c]" NaN NaN

If we compare that to reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv, we see that it adds the 20 amino acids on top of the minimal condition:

"L-ALPHA-ALANINE[p]" NaN NaN
"ARG[p]" NaN NaN
"ASN[p]" NaN NaN
"L-ASPARTATE[p]" NaN NaN
"CYS[p]" NaN NaN
"GLT[p]" NaN NaN
"GLN[p]" NaN NaN
"GLY[p]" NaN NaN
"HIS[p]" NaN NaN
"ILE[p]" NaN NaN
"LEU[p]" NaN NaN
"LYS[p]" NaN NaN
"MET[p]" NaN NaN
"PHE[p]" NaN NaN
"PRO[p]" NaN NaN
"SER[p]" NaN NaN
"THR[p]" NaN NaN
"TRP[p]" NaN NaN
"TYR[p]" NaN NaN
"L-SELENOCYSTEINE[c]" NaN NaN
"VAL[p]" NaN NaN

so we guess that NaN in the upper mound likely means infinite.

We can try to understand the less obvious ones:

ADP: TODO
PI: TODO
PROTON[c]: presumably a measure of pH
GLC[p]: glucose, this can be seen by comparing minimal.tsv with minimal_no_glucose.tsv
AMMONIUM: ammonium. This appears to be the primary source of nitrogen atoms for producing amino acids.
CYTOSINE[c]: hmmm, why is external cytosine needed? Weird.

reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/ contains sequences of conditions for each time. For example:
- reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000000_basal.tsv contains:
  "time (units.s)" "nutrients" 0 "minimal"
  which means just using reconstruction/ecoli/flat/condition/nutrient/minimal.tsv until infinity. That is the default one used by runSim.py, as can be seen from ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Environment/attributes/nutrientTimeSeriesLabel which contains just 000000_basal.
- reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000001_cut_glucose.tsv is more interesting and contains:
  "time (units.s)" "nutrients" 0 "minimal" 1200 "minimal_no_glucose"
  so we see that this will shift the conditions half-way to a condition that will eventually kill the bacteria because it will run out of glucose and thus energy!
Timeseries can be selected with --variant nutrientTimeSeries X Y, see also: run variants.
We can use that variant with:
```
VARIANT="condition" FIRST_VARIANT_INDEX=1 LAST_VARIANT_INDEX=1 python runscripts/manual/runSim.py
```
reconstruction/ecoli/flat/condition/condition_defs.tsv contains lines of form:
```
"condition" "nutrients"                "genotype perturbations" "doubling time (units.min)" "active TFs"
"basal"     "minimal"                  {}                       44.0                        []
"no_oxygen" "minimal_minus_oxygen"     {}                       100.0                       []
"with_aa"   "minimal_plus_amino_acids" {}                       25.0                        ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]
```
- condition refers to entries in reconstruction/ecoli/flat/condition/condition_defs.tsv
- nutrients refers to entries under reconstruction/ecoli/flat/condition/nutrient/, e.g. reconstruction/ecoli/flat/condition/nutrient/minimal.tsv or reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv
- genotype perturbations: there aren't any in the file, but this suggests that genotype modifications can also be incorporated here
- doubling time: TODO experimental data? Because this should be a simulation output, right? Or do they cheat and fix doubling by time?
- active TFs: this suggests that they are cheating transcription factors here, as those would ideally be functions of other more basic inputs

6. Model validation

Words: 5

TODO compare with actual datasetes.

7. Publications

Words: 78

Unfortunately, due to lack of one page to rule them all, the on-Git tree publication list is meager, some of the most relevant ones seems to be:

2021 open access review paper: journals.asm.org/doi/full/10.1128/ecosalplus.ESP-0001-2020 "The E. coli Whole-Cell Modeling Project". They should just past that stuff in a README :-) The article mentions that it is a follow up to the previous M. genitalium whole cell model by Covert lab. Only 43% of known genes modelled at this point however, a shame.
2020 under Science paywall: www.science.org/doi/10.1126/science.aav3751 "Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation"