Question

Best Practices for Dealing with Large CSV Files

May 24, 2016 5:11PM PDT

I have a large file that zipped is 557 meg and unzipped seems to be about 800 terabytes. I can't tell if that's the zip bomb virus thing or not but I ran and checked for viruses last night and updated virus checkers to be sure they're current. I'm going to run virus checkers again tonight. It's a public record data file so it's possible. I'm wondering if Windows can handle such a thing if I do unzip it. Right now I'm on Windows 7. I really don't like Windows 10 just generally. I'm also wondering if there will be an editor to handle the file size once I get a server set up that can handle this. Do you have suggestions? I hope I'm using the right forum and am not posting something that's been discussed many times but it's very important to me and I hope to have the data I want out of this into SQL Server Express or something similar soon. I appreciate any suggestions for dealing with files of this size as I believe there will be others in the future. I'm wondering if I can use ETL or a text editor to break this up into something windows and my hard drive will understand as of course I don't have a machine that big right now.

Discussion is locked

Follow
Reply to: Best Practices for Dealing with Large CSV Files
PLEASE NOTE: Do not post advertisements, offensive materials, profanity, or personal attacks. Please remember to be considerate of other members. If you are new to the CNET Forums, please read our CNET Forums FAQ. All submitted content is subject to our Terms of Use.
Reporting: Best Practices for Dealing with Large CSV Files
This post has been flagged and will be reviewed by our staff. Thank you for helping us maintain CNET's great community.
Sorry, there was a problem flagging this post. Please try again now or at a later time.
If you believe this post is offensive or violates the CNET Forums' Usage policies, you can report it below (this will not automatically remove the post). Once reported, our moderators will be notified and the post will be reviewed.
Comments
- Collapse -
Answer
Time to ask questions to those that gave it to you.
May 24, 2016 5:26PM PDT

Text files do compress very well but I question if you really have a 800 terabyte space to unzip this to.

One of the largest disk drives today is only 8TB ( https://www.google.com/#q=8tb+hdd ) so few would have a server or cloud farm with 100 or more of these drives to hold that big a file.

Can you tell more about this?

- Collapse -
do I really have space
May 24, 2016 7:02PM PDT

I have one service that is looking into renting me a petabyted sized server at a fairly reasonable price. I choose not to discuss details online. In fact I have seen 20 TB external drives for sale which are also not big enough which is why I'm looking at renting a server.

I can't really tell you a lot more about the data since I haven't opened it and only have field names for it. I'm not sure how that would help.

I would rather not go off on a tangent of the conversation.

- Collapse -
The question about 800GB databases was kicked around
May 24, 2016 8:11PM PDT

In prior discussions. In short you get a database/cloud person on your team. This is not something most folk can deal with today. While I have written SQL apps by the dozen this one would have me checking the client's ability to pay for the work and servers.

- Collapse -
efficiency
May 24, 2016 8:21PM PDT

I'm the one doing the work and arranging for the server service. I'm just wondering about software options.

- Collapse -
It won't be a text editor.
May 24, 2016 8:46PM PDT

So far your questions are those I expect from a Windows PC user. That's not a bad place to start but the answer is no. I've never found anyone dare open a multi terabyte file with a text editor. The delay would be so long that well, you may have to wait a week for it to come back.

Have you tried this?

So if this was mine I'd get the data into some SQL server so I can run queries. But that's my weapon of choice.

Post was last edited on May 24, 2016 8:50 PM PDT

- Collapse -
Re: software options
May 29, 2016 5:10AM PDT

Since you write "the data I want out of this" better contact the owner of the public data to give you just that data and no more. How much MB would that be?

Doing the selection in the original database would be much cheaper and faster than your current plans, I think.

- Collapse -
Answer
Re: text editor
May 29, 2016 5:24AM PDT

Text editors copy the file into RAM. So to handle a 800 TB file with a text editor, you need a machine with 800 TB RAM or 800 TB of virtual memory. I don't think those exist. So a text editor doesn't seem to be the right tool.

ETL tools generally read the data from a file sequentially while processing it. Even rather basic Unix tools like awk and sed work like that. However, it's up to you to program them. But not all tools will be able to handle an input file spanned over different disks. The OS should support that. Windows doesn't. The maximum file size in Windows is one disk or less, depending on the version of Windows.
You might even need to write your own unzipping program to make 80 ten TB files from 1 zip-file.

- Collapse -
Thanks
May 29, 2016 2:20PM PDT

I hadn't thought about the size handled ram. I don't know that this group will be willing to create a custom csv file for me but I guess I could ask. I have a meeting about servers this week but I think SSIS is going to be the answer.

CNET Forums

Forum Info