Remove a range of lines from a huge file

Remove_a_range_of_lines_from_a_huge_fileWhen you have to deal with a huge file containing gigabytes of data and need to delete a range of lines from it, opening it with a text editor can be a fatal mistake. As an example, in this post I will demonstrate how to delete a whole database out of a huge SQL dump without messing around with a text editor.

In the following I use a MySQL dump with around 48729 lines and a total file size of about 40GB. Even trying to open this in a text editor will be a nightmare. I used the following command to get the number of lines in my huge dump file:

$ sed -n '$=' mysql_database.dump
48729

The most important task to remove a range of lines from such a huge file is to identify the lines you want to remove. For a MySQL dump you can use the comment lines in the dump to find, for example, the database you want to remove from it.

$ mysqldump -h 127.0.0.1 -u username -p --all-databases >mysql_database.dump

Lets assume you have exported the complete mysql database with the above command. You will find comments like the following in the dump file. The easiest wat to check this is to use the “head” command which will show you just a number of lines from the beginning of the file.

--
-- Current Database: `website`
--

This line can be used as some kind of identifier for finding where each database starts in the file. With the following command we will identify the exact line number at which this “Current Database: ” line appears.

$ grep -n "Current Database: " mysql_database.dump 

The parameter “-n” will instruct grep to show the line number of the match. The result might look similar to this.

19:-- Current Database: `website`
1715:-- Current Database: `mysql`
1793:-- Current Database: `project_1`
23206:-- Current Database: `project_2`
24457:-- Current Database: `test`
24679:-- Current Database: `project_3`
44237:-- Current Database: `project_4`
44328:-- Current Database: `project_5`
44390:-- Current Database: `project_6`
47030:-- Current Database: `syslog`

With the line numbers at hand the last and final step is to remove the lines from the 40GB file. To do this, we use the “sed” command. With the following command we will instruct sed to delete the lines from the output. It will read the file content, delete the specified range of lines and the resulting output will be redirected to a file (instead of being printed to the screen).

$ sed '1715,1792d' mysql_database.dump >mysql_database_filtered.dump

With the command above, the database “mysql” – starting at line 1715 and ending one line before the “project_1” database starts – will be deleted. The “d” next to the range of lines instructs “sed” to delete the given range of lines.

If you need more then one of range deleting, separate them with a semicolon like this “1714,1793d;24457,24679d”. If you have the need of the opposite (to include only the lines specified) the command would look like this.

$ sed -n '1714,1793p; 1793q' mysql_database.dump >mysql_database_filtered.dump

The command above will instruct “sed” with the “-n” option to print no lines from the input file while the “1714,1793p” will instruct “sed” to print this range of lines. You can specify additional ranges to include, in the same way as with deletion of lines.

The last part of the “sed” instruction is more a performance tweak. It instructs “sed” to stop processing (“q” for quit) at that specific line. As it does not make any sense to let “sed” continue processing the huge file after the last line you want to print out is processed.


Read more of my posts on my blog at http://blog.tinned-software.net/.

This entry was posted in Linux Administration and tagged , . Bookmark the permalink.