How to Read 5 Line of Text File Using Pandas

What is a fixed width text file?

A fixed width file is similar to a csv file, simply rather than using a delimiter, each field has a set number of characters. This creates files with all the data tidily lined upward with an appearance similar to a spreadsheet when opened in a text editor. This is convenient if you're looking at raw data files in a text editor, merely less ideal when you need to programmatically work with the data.

Fixed width files accept a few common quirks to keep in mind:

When values don't consume the total graphic symbol count for a field, a padding graphic symbol is used to bring the character count up to the total for that field.
Whatever character can be used equally a padding character as long equally it is consequent throughout the file. White space is a common padding character.
Values tin exist left or correct aligned in a field and alignment must exist consistent for all fields in the file.

A thorough description of a fixed width file is available here.

Note : All fields in a fixed width file practice non need to have the same character count. For example: in a file with three fields, the offset field could be 6 characters, the second 20, and the concluding nine.

How to spot a fixed width text file?

Upon initial examination, a fixed width file tin can look like a tab separated file when white space is used as the padding character. If you're trying to read a fixed width file as a csv or tsv and getting mangled results, try opening it in a text editor. If the information all line up tidily, it'due south probably a stock-still width file. Many text editors also requite graphic symbol counts for cursor placement, which makes information technology easier to spot a pattern in the character counts.

If your file is as well large to hands open in a text editor, there are diverse ways to sample portions of it into a separate, smaller file on the command line. An easy method on a Unix/Linux arrangement is the head command. The example below uses head with -n 50 to read the commencement 50 lines of large_file.txt so copy them into a new file called first_50_rows.txt.

          head -n l large_file.txt > first_50_rows.txt

Allow's work with a real life instance file

UniProtKB Database

The UniProt Knowledgebase (UniProtKB) is a freely accessible and comprehensive database for protein sequence and note data bachelor nether a CC-BY (4.0) license. The Swiss-Prot co-operative of the UniProtKB has manually annotated and reviewed information well-nigh proteins for various organisms. Consummate datasets from UniProt information tin be downloaded from ftp.uniprot.org. The data for human proteins are contained in a prepare of stock-still width text files: humchr01.txt - humchr22.txt, humchrx.txt, and humchry.txt.

We don't need all 24 files for this example, so here's the link to the showtime file in the fix:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/humchr01.txt

Examine the file before reading it with pandas

A quick glance at the file in a text editor shows a substantial header that we don't demand leading into 6 fields of data.

Snippet of the beginning of humchr01.txt

Stock-still width files don't seem to be as common as many other data file formats and they can look similar tab separated files at first glance. Visual inspection of a text file in a good text editor before trying to read a file with Pandas tin substantially reduce frustration and assist highlight formatting patterns.

Using pandas.read_fwf() with default parameters

Note: All lawmaking for this example was written for Python3.6 and Pandas1.2.0.

The documentation for pandas.read_fwf() lists 5 parameters:

filepath_or_buffer, colspecs, widths, infer_nrows, and **kwds

Two of the pandas.read_fwf() parameters, colspecs and infer_nrows, have default values that piece of work to infer the columns based on a sampling of initial rows.

Let's apply the default settings for pandas.read_fwf() to get our tidy DataFame. Nosotros'll leave the colspecs parameter to its default value of 'infer', which in turn utilizes the default value (100) of the infer_nrows parameter. These two defaults attempt to find a blueprint in the first 100 rows of data (subsequently any skipped rows) and apply that pattern to separate the data into columns.

Bones file cleanup

In that location are several rows of file header that precede the tabular info in our example file. We demand to skip them when we read the file.

None of the parameters seem ideal for skipping rows when reading the file. So how do we exercise it? We utilize the **kwds parameter.

Conveniently, pandas.read_fwf() uses the same TextFileReader context manager equally pandas.read_table(). This combined with the **kwds parameter allows us to use parameters for pandas.read_table() with pandas.read_fwf(). So we tin use the skiprows parameter to skip the first 35 rows in the example file. Similarly, we can use the skipfooter parameter to skip the terminal five rows of the example file that contain a footer that isn't role of the tabular data.

          pandas.read_fwf('humchr01.txt', skiprows=35, skipfooter=five)

The above endeavor leaves the DataFrame a bit of a mess 😔:

Annotation: Since we're using the default values for colspecs and infer_nrows we don't have to declare them.

Part of the upshot here is that the default colspecs parameter is trying to infer the column widths based on the showtime 100 rows, but the row right earlier the tabular data (row 36 in the file and shown in the column names above) doesn't really follow the graphic symbol count patterns in the data table, so the inferred cavalcade widths are getting mangled.

If nosotros'd set skiprows to 36 instead of 35, we'd have concluded upwardly with the first row of data pushed into the column names, which as well mangles the inferred column widths. At that place's no winning hither without some additional cleanup. Let's settle the cavalcade names issue with the names parameter and see if that helps.

Notation: Using the names parameter means we are not allocating a row in the file to cavalcade names, and then we every bit users have to brand sure to business relationship for the fact that skiprows must start at the first data row. So skiprows is set to 36 in the next instance but it was 35 in previous examples when we didn't utilise the names parameter.

          pandas.read_fwf('humchr01.txt', skiprows=36, skipfooter=5, names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])

That's meliorate, but still a bit of a mess. Pandas inferred the cavalcade splits correctly, simply pushed the first 2 fields to the index. Let'due south fix the alphabetize issue past setting index_col=Imitation.

          pandas.read_fwf('humchr01.txt', skiprows=36, skipfooter=5, index_col=False, names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])

That looks good! The columns are split correctly, the column names make sense and the beginning row of data in the DataFrame matches the first row in the example file.

We relied on the default settings for two of the pandas.read_fwf() specific parameters to get our tidy DataFame. The colspecs parameter was left to its default value of 'infer' which in turn utilizes the default value of the infer_nrows parameter and finds a blueprint in the starting time 100 rows of data (subsequently the skipped rows) and uses that to split the data into columns. The default parameters worked well for this example file, simply nosotros could likewise specify the colspecs parameter instead of letting pandas infer the columns.

Setting field widths manually with colspecs

Simply like with the example higher up, we need to commencement with some bones cleanup. We'll drib the header and footer in the file and set the column names just like before.

The side by side pace is to build a list of tuples with the intervals of each field. The list beneath fits the example file.

          colspecs = [(0, 14), (14, 30), (xxx, 41), (41, 53), (53, 60), (60, -1)]

Note the terminal tuple: (60, -1). We can apply -ane to indicate the last index value. Alternately, we could use None instead of -i to indicate the last index value.

Note: When using colspecs the tuples don't take to be exclusionary! The last columns tin can exist set to tuples that overlap if that is desired. For example, if you want the first field duplicated: colspecs = [(0, 14), (0, 14), ...

          pandas.read_fwf('humchr01.txt', skiprows=36, skipfooter=5, colspecs=colspecs, names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])

Once more than we've attained a tidy DataFrame. This time we explicitly declared our field start and finish positions using the colspecs parameter rather than letting pandas infer the fields.

Conclusion

Reading fixed width text files with Pandas is easy and attainable. The default parameters for pandas.read_fwf() piece of work in nearly cases and the customization options are well documented. The Pandas library has many functions to read a diverseness of file types and the pandas.read_fwf() is one more useful Pandas tool to keep in mind.

courticechand1991.blogspot.com

Source: https://towardsdatascience.com/parsing-fixed-width-text-files-with-pandas-f1db8f737276

How to Read 5 Line of Text File Using Pandas

What is a fixed width text file?

How to spot a fixed width text file?

Allow's work with a real life instance file

UniProtKB Database

Examine the file before reading it with pandas

Using pandas.read_fwf() with default parameters

Bones file cleanup

Setting field widths manually with colspecs

Conclusion

0 Response to "How to Read 5 Line of Text File Using Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel