catalogs
Part 2 Excel data filtering and distribution statistics
Part 3 Pandas Conditional Data Filtering
1. Different dimensions of conditional data screening
2, composite conditions screening
Part 1 Introduction
In Python, the third-party library Pandas is the main tool for data cleansing, processing, and analysis. Almost all the needs based on tabular data can be realized in Pandas, which is one of the factors that make Python the dominant language in the field of data analysis. In the last technical article, we introduced the data selection function in Pandas, which lays a good foundation for us to batch process data in the future, and in this article, we will continue to introduce Pandas.Learn to handle the data filtering function which is very commonly used in tabular data.Pandas is used for data filtering in office software such as Excel / WPS. In Excel / WPS and other office software, data filtering is often used, and Pandas naturally has data filtering capabilities, but not only that, because Pandas backed by the Python language, so it has a natural advantage in character processing, and extends to the data filtering, Pandas can be with regular expressions on the data to do whatever you want, Pandas can filter data as it pleases with regular expressions, but what’s so great about it? Here we learn.This tutorial is based on pandas version 1.5.3. All Python code in this article was written in the integrated development environment Visual Studio Code (VScode) using the interactive development environment Jupyter Notebook, and the code shared in this article should be opened using VScode.Tap the link to the original article to see how to get all the demo code for this article and the data used for the demo: Python Teaching | Pandas’ fabulous conditional data filtering Presentation data is part of the information of A-share listed companies (18 fields, statistic time is May 12, 2023)
Part 2 Excel data filtering and distribution statistics
In Excel / WPS and other office software, the most basic data filtering is mainly based on the contents of the cell to WPS, for example, select a column of content, click on the upper right [filter], and then click on the field name of the lower-right button will pop-up filtering by value filtering box, as shown in the following chart. This field content as a condition of data screening is the most basic, commonly used one, after clicking on the [filter], WPS will immediately do a distribution of all the contents of the current field statistics, showing the number and percentage of each value, so we are also accustomed to the use of filtering to do the distribution of field value statistics, said here have to introduce the Pandas in the field value of the statistical operations. The following is to read the data and statisticsListing Location
Code for field value distribution.
Read and display some of the data:
# Import pandas
import pandas as pd
# Read demo data
data = pd.read_excel('. /A-share Listed Companies Partial Information (as of 2023.5.12).xlsx')
# Handle field names, do simplification
data.columns = [COL.split('\n')[0] for COL in list(data.columns)]
# Display the first two lines of data
data.head(2)
Statistics on the number of field values: StatisticsListing Location
Distribution of values in fields
# of statistics
data['Listing Location'].value_counts()
# The results are shown in the chart below, with the largest number of A-share listed companies listed in "Shenzhen", with a value of 2771.
Field value ratio statistics: statisticsListing Location
Distribution of values in the field, presented as a ratio
# Ratio statistics, add a parameter normalize=True to volume statistics
data['Listing Location'].value_counts(normalize=True)
In addition, if you need to query the content of the field for fragmented content, you can also click on the [Text Filter] in the upper right of the filter box shown above for more detailed filtering. In addition to finding explicit characters, you can also use wildcards*
and?
Fuzzy matching (*
representing any number of characters.?
(representing any single character), rather like a regular expression.
In Pandas, similar data filtering also exists, not only that, in the fuzzy lookup, because Pandas has been integrated into the regular expression, so in the text data filtering, you can use regular expressions for more free and detailed filtering operations, in addition, Pandas can be filtered from more dimensions, we learn together below.
In Pandas, similar data filtering also exists, not only that, in the fuzzy lookup, because Pandas has been integrated into the regular expression, so in the text data filtering, you can use regular expressions for more free and detailed filtering operations, in addition, Pandas can be filtered from more dimensions, we learn together below.
This article is a simple comparison between Excel / WPS and Pandas in terms of data filtering, the intention is not to step on one to praise the other, but to illustrate the differences between the different tools. If you can learn the practical operation of Excel / WPS here, this is of course also a good thing.
Part 3 Pandas Conditional Data Filtering
In the previous installment of the Introduction to Data Selection article, we introduced the use of theloc
function for simple conditional filtering. In fact, in addition to theloc
function, there are several other functions or forms in Pandas that can be used to filter data, such as thequery
function, the slice ([]
), etc. Among these functions, the author believes that the best way to generalize is to use slices ([]
) to do data filtering. The form of slicing and filtering is as follows.
Variable name [<conditional expression>]
How do you use slices to do data filtering? Below we read the data from the previous sectiondata
As an example, the screening conditions from different dimensions are introduced.
1. Different dimensions of conditional data screening
(1) Comparing data values
One of the simplest, yet most common data filtering conditions is by determining whether the value of a field is equal to, or not equal to, a specified value. For example when we want toScreening data of A-share listed companies with listing in Beijing, you can use the following filtering code.# Filter data of A-share listed companies with listing in Beijing
data[data['Listing Location'] == 'Beijing']
codedata['Listing Location'] == 'Beijing'
It is the condition of filtering, and the filtering result (part) is shown below.
If you want to getdata
If a company is not listed in Beijing, then the filtering condition is that the listing location is not equal to “Beijing”, and the code is written by adding the equal sign to the above code.==
Change to not equal sign!=
This is sufficient, and will not be demonstrated specifically here.
When there are more situations, such asshortlistmunicipalities
Fields are data from one of Beijing, Shanghai, Guangzhou, or Shenzhen.There is more than one method that can be used, and the following is the code for one of them.
# Filter out data where the city field is one of Beijing, Shanghai, Guangzhou, or Shenzhen.
data[data['city'].isin(['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'])]
The above two cases are filtering based on a field whose content is text. If we need to filter a numeric field, we can also use other comparison operators in Python to do the filtering, as an example, when we need toScreen A-share listed companies with 10,000 or more employees.The following code can be used when
# Filter companies with 10,000 or more employees
data[data['total employees'] >= 10000]
Similarly, you can also use greater than, less than, less than or equal to other comparison operators to do data screening. These compare the size of data values in the screening conditions, in addition to being able to take fields and constants to do comparisons, you can also take the fields and fields for comparison, for example, we can filter theTotal number of employees
greater than or equal toCode of the administrative division to which it belongs
data (of course this condition doesn’t make any practical sense, it’s just an example here), you can use the following code.
# Start by converting the Affiliated Administrative Code field type to numeric
data['Administrative division code'] = data['Administrative division code'].asype(int)
# Filter out data where the total number of employees is greater than or equal to the administrative division code to which they belong
data[data['Total number of employees'] >= data['Administrative division code']]
(2) Whether the value is null
There are times when we need to do filtering based on whether the value of a field is null or not, for exampleFiltering datadata
In theListing criteria
Data with non-null field values, you can use the following code.
# Filter data with non-null values in the belonging to listing criteria field
data[data['Listing criteria'].notna()]
If you need to get data where a field value is null, you can just replace notna() with isna() in the above code.
The default null value in Python is None, and in Pandas, the default null (or missing) value in tabular data isNaN
It is a constant in the numpy library, meaning Not a Number, and since Pandas was developed based on the numpy library, this setting is preserved in the table data by setting the null value to NaN (numpy.nan). Theisna()
andnotna()
The null values recognized by the function also include None and NaN, but not the null character ”, which is the empty character''
Not a null value in pandas.
(3) Text Screening
As mentioned in the introduction to Excel / WPS, this type of office software can do text filtering on a field, filtering the field contains, does not contain, the beginning of the (not) is, the end of the (not) is a keyword data, in addition to the use of some wildcard characters to do fuzzy filtering. This kind of filtering operations in Excel has been very powerful, but these have a regular expression supported by the face of Pandas, are considered a piece of cake. Here we introduce Pandas how to filter data based on the text content of the field. First, the simplest condition is whether the field contains (or does not contain) a keyword, for exampleFiltering datadata
In theCompany Name in Chinese
Enterprises with the keyword “securities” in the field, you can use the following data filtering code.
# Filter the data for companies with the keyword "securities" in the Chinese name field.
data[data['Company Name in Chinese'].str.contains('Securities')]
# 41 eligible data after screening
If you wish to filter out allCompany Name in Chinese
For data that does not contain the keyword “securities”, a wave character can be added to the top of the conditional expression in the above code.~
, the code is as follows (no longer show the filtering results).
# Filter the data for enterprises whose Chinese company name field does not contain the keyword "securities".
data[~data['Company Chinese Name'].str.contains('Securities')]
In addition to the above basic keyword filtering, the other Excel kind of text according to the beginning or end of the filtering operation, in Pandas can be realized through regular expressions, so how to use regular expressions in Pandas data filtering? It’s very simple, take the above code as an example, just pass the regular expression into thecontains
function, just replace the keyword. For examplescreeningCompany Name in Chinese
wordData containing “Securities”, “Banks” or “Insurance” in the fieldWhen you do, you can use a regular expression to process it, with the following code.
Regular expression tutorials have been updated, please go to the end of the article to find the relevant links
# Filter data with "Securities", "Banking" or "Insurance" in the Company Name field.
data[data['Company Chinese Name'].str.contains('Securities|Banking|Insurance')]
In addition to this, we can also use more complex regular expressions for more complex filtering operations, such asscreeningregistered address
Data with no Chinese in the fieldThis can be done using the code below.
# Filter data that does not contain Chinese characters in the Registration Address field
data[~data['registered address'].str.contains('[\u4e00-\u9fa5]')]
According to the text to do filtering, there is a need to pay special attention to, that is, to filter the field, must be all the values are character types, can not be empty values (if contains empty values, you can fill in the missing value for the character type), can not contain the number type and date type, because regular expressions can not be done on these types of processing.
(4) Data value length
Sometimes the length of the text in a field can also reflect some information, for example, the more characters in the registered address, the more detailed the description of the address, and Pandas can be filtered according to the length of the characters, for example, when the need for thescreeningregistered address
All data in a field with a character count less than or equal to 10The following code can be used when
# Filter all data in the Registered Address field with a character count less than or equal to 10.
data[data['registered address'].str.len() < 10]
(5) Date Filtering
In addition to filtering on numeric values and characters, Pandas can also filter data by date, as Excel does, for exampleScreening for new A-share companies to be listed in 2023, then you can use the following code.# Start by converting the listed date field to a date type in python
data['Listing Date'] = data['Listing Date'].asype('datetime64[ns]')
# Screening for new A-share companies to be listed in 2023
data[data['Listing Date'] >= pd.Timestamp(2023, 1, 1)]
(6) Other
In addition to the above data filtering dimensions, Pandas data filtering can also have other dimensions, such as according to the index value, field name filtering, but the practicality of these points is not high, this will not be too much introduction.2, composite conditions screening
You may have noticed that the data filtering described above has only one filter condition in all the filtering cases, although the dimensions are different. In fact, Pandas data filtering can have more than one condition (two or more). When there are more than one condition, it is best to enclose each condition in parentheses, and if two of the conditions are “or” relationships, then use the logical or notation|
to connect them; if the relationship between two conditions is “and”, then the logical and notation should be used&
to connect them. In Pandas, all the conditional forms described above can be combined in any permutation. Here are a few examples of how to do compound condition filtering.
[Example 1] Filter out the data of A-share listed companies whose administrative district is located in “Jiangsu, Zhejiang and Shanghai” and whose total number of employees is greater than or equal to 20,000.
## [Example 1] Filter the data of A-share listed companies whose administrative district is located in "Jiangsu, Zhejiang and Shanghai" and whose total number of employees is greater than or equal to 20,000.
# A few unnecessary spaces have been added to the code for ease of viewing
data[ (data['Administrative divisions'].isin(['Shanghai', 'Jiangsu', 'Zhejiang'])) & (data['Total employees'] >= 20000)]
[Example 2] FilteringEnterprise size
is large, orregistered capital
Greater than or equal to 1 billion enterprise data.
## [Example 2] Filter the data of enterprises whose business size is large, or whose registered capital is greater than or equal to 1 billion dollars.
data[(data['Enterprise Size'] == 'Large') | (data['Registered Capital'] > 1000000000)]
This is all about Pandas conditional data filtering.