Advertisement

Serving large files over TCP

Started by January 13, 2015 07:06 PM
27 comments, last by hplus0603 9 years, 10 months ago

My server needs to send some large files to my clients over the TCP connections. These are personal files, so they cannot be hot linked through a CDN url. These files are also stored on S3.

What I am thinking right now is to have my server downloads these files from S3, then send it over the tcp. However, now I am concerned with the memory usage because server needs to download the file first, which then will be kept in memory before sending it over to clients. If these files averaged to about 5MBs, and I have 1000 connected clients, and there is one file for each client, that means server must have at least 8-10GB of RAM. The 5MB/file average is a actually pretty conservative measure. Some of these files can be as big as 20MBs.

Is there a better alternative?

Why do you need to download all files before starting the first upload? As far as I understand the OP, the files are stored separately on the server, and hence can be handled one by one or at least in groups.

Advertisement

Why do you need to keep the file in memory instead of streaming the download/upload to/from the disk? And even if you need to keep the files in memory, do you really need to keep all files at the same time in memory?

Is there any reason your server downloads the file from s3 just to send to the client? This means at least 1gb wasted bandwidth to send a 512mb file. Can't you just provide the http url of the files s3 to the client and let them directly download it?

You said these were personal files but they can be hidden behind a wrapper script on the s3 instance very easily with a few lines of php...

These are personal files, so they cannot be hot linked through a CDN url. These files are also stored on S3.

These two statements work against each other.

If you've got authenticated S3 buckets you can serve up your data with a very simple HTTP wrapper. Stream in becomes the stream out. Yes, you are paying twice for bandwidth, but that is your choice by going through your second server.

If you can give your clients ACL entries or if the buckets are unauthenticed just give them the S3 endpoint and be done.

Or, if your clients have already logged in to your site and you want to keep access to S3 limited, use a temporary security credential on that file or that client's bucket.

Or, if S3 is not the best fit, you can use one of their other storage systems and provide access with other mechanisms such as running secure FTP on an EC2 server.

I didn't mean to store them all. That's just the worst case scenario in case the clients are connected and requesting these files at the same time. After you downloaded them from s3, wouldn't these files, at some point in time, reside in memory, even for a short while?

I will investigate more if S3 allows some temporary credentials to be issued. All I have read is that S3 has some authentication, but that's more for my servers, not my users.

Advertisement

I didn't mean to store them all. That's just the worst case scenario in case the clients are connected and requesting these files at the same time. After you downloaded them from s3, wouldn't these files, at some point in time, reside in memory, even for a short while?

Only if you keep the entire file there. But, as I said, you could probably stream them to/from disk so you don't have to keep the entire file in memory, or as others said if you forward them directly to the user while you download them yourself.


I will investigate more if S3 allows some temporary credentials to be issued. All I have read is that S3 has some authentication, but that's more for my servers, not my users.

Yes, you can issue temporary credentials. You can also use their various AssumeRole... functions, federation tokens, and sessions tokens, depending on what you are doing. Unfortunately the granularity is pretty rough. If you are storing each client's information in their own S3 bucket this can work well. If you have large S3 buckets that share many different customer's data it may not work very well.

Why do you have to read the files from S3 if you have your own servers?
You may keep them there for backup, but I presume that a copy of the files will live on your servers' disks at all times.
That way, your servers just stream files from disk to clients -- like any web server.

If you gave us more information about what you're trying to accomplish at a higher level, perhaps we could give better advice.
As it stands, your assumptions/architecture seems borderline nonsensical.
enum Bool { True, False, FileNotFound };

Do you even need authentication? You could store the files for every client in a directory which has an UUID as name (inside a directory with and UUID as name, if you want), or simply a base-64 encoded 256-bit hash of the client's salted contact details or something (would be easier for you to find them).

I mean, authentication and proxying through another server is all nice, but what attacker is going to "guess" an URL containing a 256-bit number correctly? That's pretty unlikely to happen.

As long as your client receives the URL via a secure channel, all is good. They can just as well download from S3.

This topic is closed to new replies.

Advertisement