SQL中两个字符串的公共子串

时间：2023-02-21

本文介绍了SQL中两个字符串的公共子串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要在 SQL 中找到两个字符串的公共子字符串(没有空格).

查询:

选择 *从 tbl 为 a，tbl 为 b其中 a.str <>b.str

示例数据:

str1 |str2 |不带空格的最大子串----------+-----------+---------------------------——aabcdfbas |里克迪瓦 |发展基金aaab akuc |aaabir |aaabab akuc |ab atr |AB

解决方案

我不同意那些说 SQL 不是这项工作的工具的人.在有人向我展示比我的任何编程语言的解决方案更快的方法之前，我会断言 SQL(以基于集合的方式编写，无副作用，仅使用不可变变量)是唯一em> 用于此工作的工具(处理 varchar(8000)- 或 nvarchar(4000) 时).下面的解决方案是针对 varchar(8000).

1.正确索引的计数(数字)表.

-- (1) 构建并填充一个持久化的(数字)计数如果 OBJECT_ID('dbo.tally') 不是 NULL DROP TABLE dbo.tally;CREATE TABLE dbo.tally (n int not null);WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))) t(N))插入数据库SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;-- (2) 为性能添加必需的约束(和索引)ALTER TABLE dbo.tally添加约束 pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;ALTER TABLE dbo.tally添加约束 uq_tally UNIQUE NONCLUSTERED(N);

请注意，计数表功能的性能将不佳.

2.使用我们的计数表将所有可能的子串返回一个字符串

以abcd"为例，让我们获取它的所有子字符串.注意我的评论.

声明@s1 varchar(8000) = 'abcd';选择位置 = t.N,tokenSize = x.N,string = substring(@s1, t.N, x.N)FROM dbo.tally t -- 令牌位置CROSS JOIN dbo.tally x -- 令牌长度WHERE t.N <= len(@s1) -- 所有位置AND x.N <= len(@s1) -- 所有长度AND len(@s1) - t.N - (x.N-1) >= 0 -- 过滤不必要的行 [e.g.substring('abcd',3,2)]

position tokenSize 字符串----------- ----------- -------1 1 一个2 1 乙3 1 c4 1 天1 2 抗体2 2 公元前3 2 cd1 3 美国广播公司2 3 bcd1 4 abcd

3.dbo.getshortstring8K

这个功能是什么?第一个重大优化.我们将把两个字符串中较短的字符串分成每个可能的子字符串，然后查看它是否存在于较长的字符串中.如果您有两个字符串(S1 和 S2)并且 S1 比 S2 长，我们知道 S1 的任何一个比 S2 长的子字符串都不是 S2 的子字符串.这就是 dbo.getshortstring 的目的:确保我们不执行任何不必要的子字符串比较.这会更有意义.

这非常重要，因为可以使用

...没有排序或不必要的操作.速度而已.对于更长的字符串(例如 50 个字符以上)，我有一个更快的技术，您可以阅读关于这里.

I need to find common substring (without space) of two strings in SQL.

Query:
select * from tbl as a, tbl as b where a.str <> b.str
Sample data:
str1 | str2 | max substring without spaces ----------+-----------+----------------------------- aabcdfbas | rikcdfva | cdf aaab akuc | aaabir a | aaab ab akuc | ab atr | ab

解决方案
I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).

1. A correctly indexed tally (numbers) table.
-- (1) build and populate a persisted (numbers) tally IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally; CREATE TABLE dbo.tally (n int not null); WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N)) INSERT dbo.tally SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d; -- (2) Add Required constraints (and indexes) for performance ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100; ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
Note that a tally table function will not perform as well.

2. Using our tally table to return all possible substrings a string

Using "abcd" as an example, let's get all of it's substrings. Note my comments.
DECLARE @s1 varchar(8000) = 'abcd'; SELECT position = t.N, tokenSize = x.N, string = substring(@s1, t.N, x.N) FROM dbo.tally t -- token position CROSS JOIN dbo.tally x -- token length WHERE t.N <= len(@s1) -- all positions AND x.N <= len(@s1) -- all lengths AND len(@s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]
This returns
position tokenSize string ----------- ----------- ------- 1 1 a 2 1 b 3 1 c 4 1 d 1 2 ab 2 2 bc 3 2 cd 1 3 abc 2 3 bcd 1 4 abcd
3. dbo.getshortstring8K

What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.

This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".

Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").
IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k; GO CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END, s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;
4. Finding all subsrings of the shorter string that exist in the longer string:
DECLARE @s1 varchar(8000) = 'bcdabc', @s2 varchar(8000) = 'abcd'; SELECT s.s1, -- test to make sure s.s1 is the shorter of the two strings position = t.N, tokenSize = x.N, string = substring(s.s1, t.N, x.N) FROM dbo.getshortstring8k(@s1, @s2) s --<< get the shorter string CROSS JOIN dbo.tally t CROSS JOIN dbo.tally x WHERE t.N between 1 and len(s.s1) AND x.N between 1 and len(s.s1) AND len(s.s1) - t.N - (x.N-1) >= 0 AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0;
5. Retrieving ONLY the longest common substring(s)

This is the easy part. We simply Add TOP (1) WITH TIES to our SELECT statement and we're all set. Here, the longest common substring is "bc" and "xx"
DECLARE @s1 varchar(8000) = 'xxabcxx', @s2 varchar(8000) = 'bcdxx'; SELECT TOP (1) WITH TIES position = t.N, tokenSize = x.N, string = substring(s.s1, t.N, x.N) FROM dbo.getshortstring8k(@s1, @s2) s CROSS JOIN dbo.tally t CROSS JOIN dbo.tally x WHERE t.N between 1 and len(s.s1) AND x.N between 1 and len(s.s1) AND len(s.s1) - t.N - (x.N-1) >= 0 AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0 ORDER BY x.N DESC;
6. Applying this logic to your table

Using APPLY we replace my variables @s1 and @s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:
-- easily consumbable sample data DECLARE @yourtable TABLE (str1 varchar(8000), str2 varchar(8000)); INSERT @yourtable VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr'); SELECT str1, str2, [max substring without spaces] = string FROM @yourtable t CROSS APPLY ( SELECT TOP (1) WITH TIES position = t.N, tokenSize = x.N, string = substring(s.s1, t.N, x.N) FROM dbo.getshortstring8k(t.str1, t.str2) s -- @s1 & @s2 replaced with str1 & str2 CROSS JOIN dbo.tally t CROSS JOIN dbo.tally x WHERE t.N between 1 and len(s.s1) AND x.N between 1 and len(s.s1) AND len(s.s1) - t.N - (x.N-1) >= 0 AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0 AND charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces ORDER BY x.N DESC ) lcss;
Results:
str1 str2 max substring without spaces ----------- --------- ------------------------------ aabcdfbas rikcdfva cdf aaab akuc aaabir a aaab ab akuc ab atr ab
And the execution plan:

... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.

这篇关于SQL中两个字符串的公共子串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持html5模板网！

上一篇：TSQL For Filter Experice From Range multiselect 下一篇：如何从 XML 中提取值?

相关文章

将每个子标记转换为具有多个分隔符的单列-SQLConverting Every Child Tags in to a Single Column with multiple Delimiters -SQL Server (3)(将每个子标记转换为具有多个分隔符的单列-SQ

如何从多个表创建视图?How can I create a view from more than one table?(如何从多个表创建视图?)

根据前一行内的计算值创建计算值Create calculated value based on calculated value inside previous row(根据前一行内的计算值创建计算值)

如何将表格的前两列堆叠成一列，但也仅将第三How do I stack the first two columns of a table into a single column, but also pair third column with the first column only?(如何将表格的前两列堆

递归 t-sql 查询Recursive t-sql query(递归 t-sql 查询)

将月份名称转换为日期/月份编号(问题和答案的组Convert Month Name to Date / Month Number (Combinations of Questions amp; Answers)(将月份名称转换为日期/月份编号(问题和答案的组合

最新文章

创建我的自定义唯一键

获取随机记录的 ROW NUMBER

我的动态 SQL 查询有什么问题?使用 char 数据类型

转换十进制 <-->二进制与 T-SQL

如何在 SQL Server 中替换正则表达式 HTML 标签?

XML 到 SQL 的问题 - SQL Server

为什么这个程序会生成两封电子邮件?

从视图中的动态表名中选择

使用变量中的名称创建数据库

SQL Server:是否可以在不设置链接服务器的情况下从