Sia is a network for remotely storing data. Typically called 'cloud storage', the core feature is that you can put data on the network and it will be available from anywhere in the world at a later date. Putting data onto a network means that someone else - a 'host' - is going to be storing the data, and is going to be responsible for returning the data when requested. Sia makes several key assumptions about the network:
Hosts cannot be trusted - if they are able, they will spy, steal, and cheat. Strong mechanisms must be used to discourage and prevent malice.
Hosts are not charitable - hosts need to be paid, especially if the data is private or is large in volume. Payment must be guaranteed.
Hosts are unstable - a single host, and even a group of hosts, is liable to go offline even if they have a history of 100% reliability
The network is hostile - if there is a way to be abusive, someone will discover it and cause abuse.
Sia is able to safely store data on a network that has the above properties. There are three core strategies employed by Sia to ensure the safety of data. The first is encryption, which serves to protect the privacy of the data even when the hosts are trying to view the data. All data on Sia is encrypted before it is ever sent over the network, and it is only decrypted after it has been downloaded. The hosts will never be able to view decrypted data. The second strategy is redundancy. Data is not given to one or two or three hosts, but instead a myriad of hosts. Using erasure coding techniques such as Reed-Solomon coding, a high reliability can be achieved even without a high redundancy. The final strategy is to align the incentives of the hosts by paying them only if they store the data, but also by guaranteeing that they will get paid for storing the data, even if the renter is not online to make the payment. This can be achieved using a file contract, and a file contract can be achieved using a blockchain.
A file contract is an agreement between a renter and a host. The renter agrees to pay the host for storing a file, and the host agrees to store the file for a certain period of time. The renter and the host both put money into the file contract at the beginning. The money from the renter will be payment for the host after the contract is fulfilled. The money from the host is collateral that the host will forfeit if the contract is not completed. The file contract goes onto the blockchain, which will serve as escrow. When the file contract is over, the host must provide a proof of storage to the blockchain proving that the file is still being stored. After the proof of storage is provided, the host's collateral is returned and the renter's payment is made to the host. If the proof of storage is not provided in time, the money is forfeit.
The file contract is enough to provide strong incentives that the host keep the file. Keeping the file provides a financial income, and losing the file results in a financial penalty. For the act of storing data, a combination of encryption and the file contract covers the first two bullet points (hosts cannot be trusted and are not charitable). There is still no guarantee that the host will not be holding the data hostage, protections against this are discussed later.
The renter does not want to rely on a single host, even with all of the financial incentives and commitments in place. The inescapable truth is that a single host is always at risk of unexpected downtime or failure (even if the host is trustworthy). This risk can be minimized by storing the data on multiple hosts. If the full data is stored on 3 hosts, then all 3 hosts would need to go offline simultaneously in order for the data to be lost. It turns out however, that we can do substantially better than 1-of-3. Reed Solomon coding provides a way to store data such that M-of-N hosts can be used to recover data, and the redundancy is only N/M (which is theoretically perfect). Instead of 1-of-3, we can do 10-of-30 for the same redundancy of 3x. Switching to 10-of-30 gives us enormous reliability benefits - the chances that 21 drives out of 30 fail is substantially lower than the chances that 3 out of 3 drives fail. It turns out that if your hosts have 95% uptime, a 10-of-30 scheme provides a file uptime exceeding 99.999999999%. This math does assume that the hosts will be failing independently, but carefully selecting hosts by region should provide reasonable independence. If the hosts have 98% uptime (allowing for 30 minutes of down time every day, or 15 hours every month), the file can hit 99.999999999% with a 18-of-30 scheme, only 1.66x redundancy. This allows for incredible cost savings, greater insulation against attackers, and provides a large pool of hosts that can be used to download files with high parallelism.
This redundancy also provides insulation against hosts who may try to hold data hostage. In a 10-of-30 scheme, you only need 10 hosts to recover your data. Downloads on Sia are paid, which means a host gets revenue every time you download data from them. If 1 or even 15 hosts are malicious and try to hold data hostage, they can be fully ignored and instead the non-malicous hosts can be used. This has a direct opportunity cost for the malicious hosts - they lose revenue from the downloads. Even further, the renters will blacklist hosts if their download prices are consistently too high. The combined pressures of being unlikely to succeed, losing out on immediate revenue, and losing out on future revenue (in the form of future uploads + downloads) means that hosts are unlikely to perform this attack (and even if they do, it's not a big deal - just ignore them and use the honest hosts). Highly paranoid renters can get further protections by using a 3-of-30 or even something like a 2-of-100 scheme (which has high redundancy overhead) to protect their most sensitive files. In all likelihood though, 10-of-30 is already sufficient even for the most sensitive files. In practice, we've seen files maintain perfect reliability even during buggy prototype releases where average host reliability was below 50%.
Renters continuously observe the blockchain and the network to verify the uptime and reliability of hosts. Renters have a strong preference for hosts that are reliable, fast, and low-cost. Additionally, the renter typically only ever uploads to a small percentage of the total number of hosts. This creates a heavy pressure on hosts to perform better. The exact algorithm is still being determined.
At this point we've covered our bases for 3 of the original points (untrustworthy hosts, non-charitable hosts, and unstable hosts), both for uploading data and for being able to retrieve it. The vast majority of Sia heavily protected against malicious attackers. The Sia blockchain very closely resembles the Bitcoin blockchain, preserving the Proof-of-Work consensus mechanism, preserving the 10 minute blocktimes, and in general copying the Bitcoin blockchain wherever possible. A few well-known bugs (such as transaction malleability) have been fixed, but otherwise the design decisions of the Sia blockchain match the Bitcoin blockchain as much as possible. A strong form of encryption is used (Twofish with 256 bit keys), and all protocols in Sia assume that the other party is going to start behaving maliciously at any moment.
There is only one significant remaining problem, which is host selection. Renters are expected to choose their own hosts, and an attacker can attempt to manipulate the renter's selection criteria a number of ways, including by setting the price really low and by performing a Sybil attack. A Sybil attack is an attack where a single person (the attacker) pretends to be many. Online, a single person can fairly easily pretend to be 10 or even 10,000. In Sia, this means that an attacker might be able to spin up 10,000 machines each pretending to be an honest host, and then take advantage of renters.
A key part of Sia's approach to stopping Sybil attackers is proof-of-burn. Hosts burn coins by sending them to a provably unspendable address. Hosts are expected to burn a portion of their revenue (~4%) as a demonstration that they are real. Renters will select hosts that have burned coins with a probability that grows in a linear relationship to the total number of coins burned. Therefore, a host that has burned 2x as many coins will be twice as likely to be selected as another host that has all other factors the same. This provides a very important defence against Sybil attacks. An attacker that is trying to manipulate a renter will need to have all of the excess redundancy of a file before being able to commit an attack. For a file with 3x redundancy, that means the attacker will need to get at least 2.1x of that redundancy, which means that the attacker will need to burn enough coins to look like 67% of the network. That entails burning 1.5x as many coins as the rest of the network has burned combined. Especially as the network grows and matures, collecting that many coins is going to be prohibitive. Unlike proof-of-stake systems, it's not sufficient to just collect those coins, they actually need to be burned, which means there's no chance of recovering that investment. While not wholly infeasible, performing a Sybil attack on Sia should be more expensive than performing a 51% attack on Bitcoin. Even better, paranoid renters can protect themselves more fully by using a higher redundancy. Renters storing at 10x redundancy are safe unless the attacker looks like 91% of the network, requiring the attacker to burn 9x as many coins as the rest of the network combined to be successful.
While the exact algorithm for selection has not been finalized, most of the criteria is understood:
Hosts are given a score, and then selected randomly according to their score
The score goes up linearly with the number of coins burned - a linear relationship is required to prevent Sybil attacks
Hosts are heavily penalized if their uptime is below 95%, and there is not a significant advantage to having an uptime greater than 99%. That is because the trust model explicitly chooses to assume that no host is more reliable than 99% - while the historic reliability is there the chance of betrayal or malice is also always present. 99% reliability across a 20-of-30 scheme is far more than sufficient to guarantee overall file reliability.
Hosts are penalized exponentially for having a price that is higher than expected, but are not preferred exponentially for having a price that is lower than expected. 'Expected' is still undefined, but will likely be determined based on the real world cost of hard drives. Low prices cannot be preferred exponentially because this leaves room for Sybil attackers to get preference by setting the price too low. A host that is 2x the reasonable cost will have 1/32 the score, but a host that is 1/2 the reasonable cost will likely only have 2x the score.
Host scores are increased linearly with the amount of collateral provided on the data, and a minimum amount of collateral is required. (to insulate against things like price volatility).
Hosts that demonstrate dishonesty are blacklisted.
The selection algorithm is not a part of any protocol, but is instead determined on a per-renter basis. This means that as our understanding of selection strategies improves, we can push out updates to renters that do not break compatibility with the rest of the network. It also means that renters with special needs (such as EU-only due to data regulations) or heightened paranoia are able to use different selection strategies without friction.
Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
Excellent breakdown of SIA. Thanks!
It's very useful to understand sia tech