New Blog 2
Background100
Traditional DNA storage systems typically rely on phosphoramidite chemistry for base-by-base synthesis, limiting information density to log₂(4) = 2 bits per synthesis cycle. The Molecular Storage System (MoSS) introduces a paradigm shift through motif-based synthesis, enabling significantly higher information density per synthesis cycle while addressing key limitations in current DNA storage architectures.
Traditional DNA Storage: Base-by-Base Encoding
Traditional DNA storage systems typically employ phosphoramidite chemistry with base-by-base encoding:
- Binary-to-Base Mapping:
- Direct mapping of binary pairs to nucleotides
- Common scheme: 00→A, 01→C, 10→G, 11→T
- Information density: 2 bits per nucleotide (theoretical maximum)
2. Synthesis Constraints:
- Sequential addition of individual nucleotides
- Coupling efficiency: ~99% per base
- Error accumulation: (1-0.99^n) for n-length sequences
- Maximum practical length: ~200-300 nucleotides
3. Information Capacity:
N (bits) = x × log₂m
where:
x = number of synthesis cycles
m = available nucleotides (typically 4)
Composite Motif Architectural Overview
The fundamental innovation in MoSS lies in its motif-based encoding scheme:
- Base Motif Library:
- 8 distinct 25-mer DNA sequences, carefully selected to:
- Minimise cross-hybridisation
- Prevent secondary structure formation
- Reduce de-novo synthesis errors
- Maintain specific GC content requirements
- 8 distinct 25-mer DNA sequences, carefully selected to:
Fundamental Unit Structure of Motif
[13nt spacer]-[25nt motif]-[12nt spacer]
Total length per unit: 50 nucelotides
- Composite Motif Formation:
- Each composite motif comprises exactly 4 motifs from the 8-motif set
- Total possible combinations: C(8,4) = 70 unique composite motifs
- Theoretical information capacity: ⌊log₂(70)⌋ = 6 bits per composite motif
Information Density Calculation
Bits per composite motif = log₂(C(8,4))
Bits per synthesis cycle = log₂(70) ≈ 6.13 bits
Practical implementation: 6 bits/cycle
2. Oligonucleotide Architecture
- Precise structural organisation per oligonucleotide:
[Outer Barcode] - [5' Primer] - [Address1] - [Address2] - [Payload1-8] - [3' Primer] - [Outer Barcode]
- Detailed cycle structure:
- Cycle 0: 5' primer flank
- Cycles 1-2: Address regions (50 nucleotides each)
- Cycles 3-10: Payload regions (50 nucleotides each)
- Cycle 11: 3' primer flank
MoSS DNA Storage Workflow
The MoSS workflow enables efficient storage and retrieval of digital data using DNA as a medium. This process involves three main stages: Code, Write, and Read.
- Code: The digital files are first converted into binary form. Users apply their own error-correction coding, producing a binary file with additional bits that facilitate accurate file reconstruction. Helixworks’ composite motif encoding system then maps these binary files to specific sequences of DNA oligos, enabling reliable and compact digital data representation in DNA.
Key Points ↗️
- Digital files are converted into binary form
- Users can apply their own error-correction coding
- Composite motif encoding system maps binary files to specific DNA oligo sequences
- Write: The encoded DNA oligos are synthesised and pooled. Each pool consists of 64 Composite Motif oligos, with each CM oligo containing a distinct segment of the encoded data. A unique outer barcode is appended to each pool of 64 CM oligos, facilitating the creation of a master pool with 6,144 CM oligos per vial. This step effectively translates digital information into a physical DNA pool, storing the data in a molecular format. <aside> <img src="/icons/promoted_gray.svg" alt="/icons/promoted_gray.svg" width="40px" /> Key Points
- Encoded DNA oligos are synthesised and pooled
- Each pool consists of 64 Composite Motif (CM) oligos
- Each CM oligo contains a distinct segment of the encoded data
- Unique outer barcodes are appended to each pool
- Master pools can contain up to 6,144 CM oligos per vial </aside>
- Read: The stored DNA is read using a nanopore-based sequencer, such as the MinION. This device captures the DNA sequences as they pass through active pores, enabling real-time processing of the resulting DNA sequences and the retrieval of the encoded composite motifs. <aside> <img src="/icons/promoted_gray.svg" alt="/icons/promoted_gray.svg" width="40px" /> Key Points
- Stored DNA is read using nanopore-based sequencing (MinION)
- Real-time processing of sequenced reads
- Retrieval of encoded composite motifs </aside>
Composite Motif Encoding: Mapping Binary Data to DNA Oligos
The Composite Motif Encoding process converts binary data into DNA oligos for molecular data storage. This encoding approach allows digital data to be translated into DNA sequences using unique combinations of DNA motifs, referred to as composite motifs.
- Binary File Preparation: The process begins with a binary file that represents digital information. Users apply error-correction coding to ensure data integrity, resulting in a binary file with error-correction bits. This file is then divided into manageable chunks and segments.
- Chunking and Segmentation: Each binary file is split into chunks based on the data capacity of a single Composite Motif (CM) oligo. Each chunk is then divided into segments, with each segment representing a specific number of bits. Each segment is mapped to a unique composite motif, which is made up of a combination of 4 motifs selected from a set of 8 available motifs.
- Codeword Dictionary: The Codeword Dictionary provides a mapping between binary sequences and composite motifs. Each composite motif is composed of a unique combination of 4 motifs selected from a total set of 8, yielding 70 unique composite motifs. This dictionary allows each binary sequence to correspond with a specific composite motif, enabling user-defined mappings that suit various data encoding needs.
- Binary-to-Motif Mapping: For each binary sequence in a chunk and it’s segments, the Codeword Dictionary is referenced to determine the matching composite motif. The flexibility of this approach allows users to control how binary sequences are assigned to composite motifs, as long as they follow the conditions: the set consists of 8 motifs, and each composite motif includes exactly 4 motifs. S. No.
S.No | Composite Motif | Binary Sequence |
1 | [1,2,3,4] | 000001 |
2 | [1,2,3,5] | 000010 |
3 | [1,2,3,6] | 000011 |
4 | [1,2,3,7] | 000100 |
5 | [1,2,3,8] | 000101 |
6 | [1,2,4,5] | 000110 |
7 | [1,2,4,6] | 000111 |
8 | [1,2,4,7] | 001000 |
9 | [1,2,4,8] | 001001 |
10 | [1,2,5,6] | 001010 |
11 | [1,2,5,7] | 001011 |
Writing Oligos: Automated Synthesis Workflow
The Writing Oligos process is the core of converting digital data into DNA sequences by synthesising Composite Motif (CM) encoded oligos in a structured and automated workflow. This process creates precise DNA pools that store encoded information with high fidelity.
- DNA Parts for CM Oligo Synthesis: Each CM oligo is constructed from several distinct DNA parts, each serving a specific function: Live Motif Search System The Live Motif Search system enables real-time processing and decoding of DNA-stored data through a streamlined workflow. This integrated system combines hardware and software components to efficiently recover digital information from sequenced DNA.
- Outer Barcodes: Outer barcodes provide unique identifiers for each oligo pool. Each pool is labeled with one of up to 96 available barcodes, enabling the creation of a master pool with 6,144 distinct chunks in a single vial. The combination of outer barcodes and address motifs determines the total number of chunks in each master vial.
- Address Motifs: Incorporated in cycles 1 and 2, address motifs define the specific location of each oligo within the sequence pool, facilitating accurate retrieval and identification of individual data segments. Paired with the outer barcodes, these address motifs enable chunking of data and encoding to CM oligos.
- Primer Flanks: These are added during cycles 0 and 11 to enable amplification and sequencing, ensuring that each CM oligo can be read and decoded.
- Payload Motifs: These motifs, incorporated from cycles 3 to 10, carry the actual data payload. Each payload motif represents a segment of the binary data, with 64 possible payload motifs allowing for diverse data encoding combinations.
- Automated Synthesis Workflow:
- Source Plate: The process begins with a source plate containing 93 distinct DNA parts, including payload motifs, primer flanks, and address motifs.
- Synthesis Plate: Using automated synthesis, 79 DNA parts are combined per well on the synthesis plate to create 64 CM oligos per plate. This setup supports precise, high-throughput synthesis for data-encoded oligos.
- Outer Barcoding: Following synthesis, outer barcodes are appended to each well containing 64 CM oligos, with 96 unique barcodes per plate. This barcoding step ensures each 64-well pool is distinctly organised. Each oligo pool contains 64 chunks uniquely identified by a single outer barcode, organizing data into manageable units. By leveraging up to 96 outer barcodes, a master pool can accommodate 6,144 chunks in a single vial.
- DNA Pool Creation: Once barcoded, the oligos are collected to form a master DNA pool, with each pool comprising 6,144 CM oligos. This pool effectively represents a fully encoded file in a molecular format, ready for storage and future retrieval.
- Data Processing Pipeline
- The process begins when the MinION sequencer generates raw signal data and stores it in FAST5/POD5 file. This initial data capture occurs in real-time as molecules pass through the nanopores of the sequencing device. The system continuously monitors the output directory for new files. Each time a new FAST5/POD5 file is generated, the pipeline automatically activates, initiating processing.
- Raw electrical signals are converted into nucleotide sequences through ONT’s basecalling algorithms. The output is provided in FASTQ format, which includes both the sequence data and quality scores for each base call. This step transforms the electrical measurements into readable nucleotide code. Reads with Q>9 are passed to the next step for downstream analysis and decoding.
- The system processes the sequence data through three stages. First, it demultiplexes the outer barcodes to organise data from different 64 CM pools. Next, it identifies address motifs to determine the position and organisation of chunks. Finally, it executes SeqKit’s locate function, to locate and match the payload motifs within a read.
- Based on the motifs identified within the chunks, the Composite Motifs are decoded, with the output provided as composite motif (CM) files that can be used along with the binary mapping to retrieve the digital information.
- Exact Motif Matching with SeqKit
- SeqKit’s locate function is employed to perform precise substring matching, identifying instances where reference motifs appear within the query sequences. This function searches for 100% matched substrings, ensuring that only exact matches of reference motifs are identified in the sequencing reads.
- The locate function scans for all possible motifs, covering primer motifs, address cycle motifs, payload cycle motifs, and all potential splint cycle combinations. These include both expected and unexpected splint cycle combinations.
- Expected vs. Unexpected Splint Cycle Combinations
- Expected Splint Cycle Combinations: These are the designed, sequential pairings of cycle motifs, such as cycle 1 followed by cycle 2, cycle 2 followed by cycle 3, and so forth, adhering to the intended order in the synthesis and encoding process.
- Unexpected Splint Cycle Combinations: Occasionally, due to faulty hybridisation or ligation, motifs may combine out of sequence, forming non-sequential pairings like cycle 9 followed by cycle 1. These unexpected combinations are detected during the search process, highlighting potential errors or anomalies in the synthesis process. Typically, these unexpected combinations represent only 1-3% of all splint cycle combinations.
Project Scope
The primary objective is to synthesise 640 composite motif oligos from user-provided binary to motif mappings.
<aside> <img src="/icons/promoted_gray.svg" alt="/icons/promoted_gray.svg" width="40px" />
Key Points
- Total synthesis: 640 CM oligos
- Organised into 10 pools of 64 CM oligos each
- Each pool uniquely barcoded with outer barcodes </aside>
Oligo Design Structure
- Outer Barcode: Unique pool identifier for a set of 64 CM oligos
- Primer Flanks: Added during cycles 0 and 11 for amplification and sequencing
- Address Motifs: Cycles 1-2, defining oligo location within sequence pool
- Payload Motifs: Cycles 3-10, carrying actual data payload
- Total Structure: 12 cycles per oligonucleotide
Capacity and Organisation
- 64 CM oligos per pool provided in individual vials
- 96 unique outer barcodes, one per 64 CM oligo pool
- 64 distinct chunks encoded in oligos per vial
- 6,144 distinct chunks encoded per master vial
Deliverables
- Individual vials, each containing 64 CM oligos
- Shipped with ice packs via courier service
- Storage recommendations provided
- Sequencing data including, *.fastq, demultiplexing and motif-search reports
- Digital QC & quantification reports for each 64 CM pool
- Outer barcode & motif sequence files
- Technical documentation for sequencing preparation
Next Steps
To proceed with the synthesis of 640 Composite Motif oligos, we kindly request you to take the following steps:
- Confirmation of Collaboration: Please confirm your interest in partnering with Helixworks on this project. An affirmative response via email will enable us to promptly initiate the necessary preparations.
- SoW Agreement & Purchase Order: Upon confirmation, we will provide a Scope of Work Agreement outlining all project details and deliverables for your review and signature. After receiving your Purchase Order, Helixworks will issue a Sales Order for the materials and services specified. Any additional work requested will be invoiced separately at the rates outlined in the agreement.
- File Sharing & Timeline Confirmation: Once the agreement is in place, we will provide template files and instructions for sharing the oligo designs. Upon receipt of these files, we will confirm the estimated shipping date.
Helixworks is excited for this collaboration and look forward to working together closely with the team. Should you have any questions, require further information, or wish to discuss any specific aspects of the proposal, please do not hesitate to reach out.